Processing a Large Collection of Documents

Using the text mining nodes to process a large collection of documents can require a lot of computing time and resources. If you have limited resources, it might be necessary to take one or more of the following actions:
  • Use a sample of the document collection.
  • When using the Text Miner node, set some of the Parse properties to No, such as Find Entities, Noun Groups, and Terms in Single Document.
  • When using the Text Parsing node, set some of the Detect properties to No, such as Find Entities and Noun Groups.
  • In the Text Miner node, reduce the number of SVD dimensions or roll-up terms. If you are running into memory problems with the SVD approach, you can roll up a certain number of terms, and then the remaining terms are automatically dropped.
  • Use the Ignore properties of the Text Parsing node to limit parsing to high information words. You can do this by ignoring all parts of speech other than nouns, proper nouns, noun groups, and verbs.
  • You can also use the Parse properties in the Text Miner node to ignore all parts of speech other than nouns, proper nouns, noun groups, and verbs.
  • Structure sentences properly for best results, including correct grammar, punctuation, and capitalization. Entity extraction does not always generate reasonable results.