Create a New Synonym Data Set

You can use the SAS Text Miner TMSPELL procedure to create a new synonym data set. The TMSPELL procedure evaluates all the terms, automatically identifies which terms are misspellings, and creates synonyms that map correctly spelled terms to misspelled terms.
To create a new synonym data set:
  1. Select the Utility tab and drag a SAS Code node into the diagram workspace. Connect the Text Miner — Symptom Text node to the SAS Code node. Right-click the SAS Code node, and select Rename. Type SAS Code — TMSPELL in the Node Name box. Click OK.
    Process flow diagram
  2. Select the arrow that connects the Text Miner — Symptom Text node to the SAS Code — TMSPELL node. Note the value of the Terms export Table property. You will use this value in the TERMDS= parameter in the next step.
    Note: The libref EMWS in the TERMS Table property is dependent upon the diagram number within your SAS Enterprise Miner project. If your diagram is the first one created, then the libref will be EMWS, the second diagram will be EMWS1, the third will be EMWS2, and so on.
    Property panel
  3. Select the SAS Code — TMSPELL node, and click the Selector Button button for the Code Editor property in the Properties panel.
  4. Enter the following code in the Code Editor:
    proc tmspell
    	data=emws.text2_terms
    	out=mylib.vaerextsyns
    	dict=mylib.engdict
    	maxchildren=10
    	minparents=8
    	maxspedis=15
    ;
    
    Code Editor Dialog Box
  5. Click the Save button button to save the changes.
  6. Click the Run button button to run the SAS Code — TMSPELL node. Click Yes in the Confirmation dialog box.
  7. Click OK in the dialog box that indicates that the node has finished running.
  8. Close the Training Code — Code Node window.
  9. From the SAS Enterprise Miner window, select View then selectExplorer . The Explorer window opens.
  10. Click Mylib, and then select Vaerextsyns.
    Note: If the Mylib library is already selected and you do not see the Vaerextsyns data set, you might need to click Get Details or refresh the Explorer window to see the Vaerextsyns data set.
  11. Double-click the Mylib.Vaerextsyns table to examine it.
    VAEREXTSYNS Data Set
    Here is a list of what the Vaerextsyns columns provide:
    • Term is the misspelled word.
    • Parent is a guess at the word that was meant.
    • Childndocs is the number of documents that contained that term.
    • # Documents is the number of documents that contained the parent.
    • Minsped is an indication of how close the terms are.
    • Dict indicates whether the term is a legitimate English word. Legitimate words can still be deemed misspellings, but only if they occur rarely and are very close in spelling to a frequent target term.
    For example, Observation 52 shows abdomin to be a misspelling of abdominal. Three documents contain abdomin, while 77 documents contain the parent, abdominal (this is not shown in the image). The term abdomin is not a legitimate English word, and an example text that contains that misspelling is 20 mins later, upper !!abdomin!!. Note that double exclamation marks (!!) both precede and succeed the child term in the example text so you can see the term in context.
  12. Examine the Vaerextsyns table to see whether you disagree with some of the choices made. For this example, however, assume that the TMSPELL macro has done a good enough job detecting misspellings.
    Note: The Vaerextsyns table can be edited using any SAS table editor. You cannot edit this table in the SAS Enterprise Miner GUI. You can change a parent for any misspellings that appear incorrect or delete a row if the Term column contains a valid term.
  13. Close the Mylib.Vaerextsyns table and the Explorer window.