Processing a Large Collection of Documents

Using SAS Text Miner nodes to process a large collection of documents can require a lot of computing time and resources. If you have limited resources, it might be necessary to take one or more of the following actions:
  • Use a sample of the document collection.
  • Set some of the parse properties to No or None, such as Noun Groups or Find Entities.
  • Reduce the number of SVD dimensions or roll-up terms. If you are running into memory problems with the SVD approach, you can roll up a certain number of terms, and then the remaining terms are automatically dropped.
  • Limit parsing to high information words by turning off all parts of speech other than nouns, proper nouns, noun groups, and verbs.
  • Structure sentences properly for best results, including correct grammar, punctuation, and capitalization. Entity extraction does not always generate reasonable results.