Using the Text Filter Node

This example shows you how to filter out terms using the Text Filter node. This example assumes that you have performed Using the Text Parsing Node, and builds off the process flow diagram created there.
  1. Select the Text Mining tab on the toolbar, and drag a Text Filter node into the diagram workspace.
  2. Connect the Text Parsing node to the Text Filter node.
    Process Flow Diagram
  3. In the diagram workspace, right-click the Text Filter node and select Run. Click Yes in the Confirmation dialog box.
  4. Click Results in the Run Status dialog box when the node finishes running.
  5. Select the Terms table. Sort the terms by frequency by clicking the Freq column heading.
    Terms table
    Assume that for this text mining analysis, you know that “software” and “application” are really used as synonyms in the documents that you want to analyze, and that you want to treat them as the same term.
  6. Close the Results window. Select the Text Filter node, and then click the Ellipses icon for the Filter Viewer property.
  7. In the Interactive Filter Viewer sort the terms in the Terms table by frequency. Hold Ctrl down on your keyboard, select “software” and “application”, and then right-click “software” and select Treat as Synonyms from the drop-down menu.
    Terms table
  8. In the Create Equivalent Terms dialog box, select software as the term to represent both terms in the Terms table.
    Create Equivalent Terms dialog box
  9. Click OK in the Create Equivalent Terms dialog box. Notice that the term “software” now represents both terms in the Terms table. Expand the term “software”.
    Terms table
  10. Close the Interactive Filter Viewer. When prompted whether you would like to save your changes, select Yes.
  11. Right-click the Text Filter node, and select Run. Select Yes in the Confirmation dialog box. Select Results in the Run Status dialog box when the node has finished running.
  12. Select the Number of Documents by Frequency plot to see how both terms are now treated as the same.
    Number of Documents by Frequency plot
    You can also use options to change your view or specify a subset of results to appear in a plot. For example, consider that you want to refine this plot to only show terms that appear in more than 200 documents.
  13. Right-click the Number of Documents by Frequency plot, and select Data Options.
  14. Select the Where tab in the Data Options Dialog box. Select # Docs from the Column name drop-down menu. Select Greater than from the Operator drop-down menu. Type 200 in the Value text box.
    Data Options Dialog box
  15. Click Apply, and then click OK. The Number of Documents by Frequency plot resizes and includes only terms that occur in more than 200 documents.
    Number of Documents by Frequency plot
  16. Close the Results window. In addition to resizing or subsetting a plot to help focus your analysis, you can also directly search for terms using the Interactive Filter Viewer.
  17. Select the Text Filter node, and then click the Ellipses icon for the Filter Viewer property. In the Interactive Filter Viewer, type software in the Search text box, and click Apply.
    Interactive Filter Viewer
    The Documents table provides a snippet of text that includes the term that you are searching for. You can use information in the Documents table to help you understand the context in which a term is being used. To do so, examine the snippet result in addition to the full text and title of the document. For more information about the Interactive Filter Viewer, see the Interactive Filter Viewer topic in the SAS Text Miner Help.
    Searching for a term in the Interactive Filter Viewer raises an interesting problem. As shown above, a search for “software” is case insensitive. However, what if there are instances of a term that you want to find that are misspelled in the document collection? You can also check for spelling when filtering terms using a dictionary data set.
  18. Close the Interactive Filter Viewer, and select No when prompted for whether you want to save changes.
  19. (Optional) Select the Text Filter node, and set the Check Spelling property to Yes. When you rerun the Text Filter node, terms will be checked for misspellings. You can also specify a data set to use in spell-checking by clicking the Ellipses icon for the Dictionary property and selecting a data set. For information about creating a dictionary data set, see the How to Create a Dictionary Data Set topic in the SAS Text Miner help.
    Right-click the Text Filter node, and select Run. Select Yes in the Confirmation dialog box. When the node finishes running, select OK in the Run Status dialog box. Click the Ellipses icon for the Spell-Checking Results property to access a window in which you can view the data set that contains spelling corrections that were generated during spell-checking. For example, the term "softwae" is identified as a misspelling of the term "software."
    Spell Checking Results
    You can see this relationship in the Terms table in the Interactive Filter Viewer. Click the Ellipses icon for the Filter Viewer property. Expand the term "software" in the Terms table to view its synonyms. The synonyms include "softwae," which was identified as a misspelled term during spell-checking.
    Terms table
    Notice that the synonyms also include "application," which was created in steps 7-10 of this example, and "applicaion," which was identified during spell-checking as a misspelling of "application."