The TEXTPARSE statement performs text analytics on the active in-memory table. You can separate the documents in the table into terms, derive topics based on weighted term frequencies, and project the active table onto the latent space defined by the topic-discovered topics.
See: | For backround information, see Text Analytics in SAS LASR Analytic Server. |
specifies the name of the variable that contains the text to analyze.
specifies the name of the variable that identifies the documents in the table uniquely. The values are typically a row number or other value that identifies the rows. The document ID is important to perform joins of the result tables.
Alias | DOCID= |
specifies how elements in the term × document matrix are weighted. Elements in the matrix are assigned weight wi * g(fij), where wi is the term weight for the ith term, fij is the frequency of appearance of this term in document j.
Default | LOG |
determines whether the entity extractor should use the standard list of entities. When ENTITIES=STD, entity extraction is enabled and standard entities are used. Terms such as "George W. Bush" are then recognized as an entity and given the corresponding entity role and attribute. For this example, the entity role is PERSON and the attribute is Entity. Although the entity is treated as the single term, "george w bush," the individual tokens "george," "w," and "bush" are also included.
Default | NONE |
specifies not to round the weights that are aggregated during topic derivation. By default, the calculated weights are rounded to the nearest .001.
Alias | NOWTRND |
specifies one or more variables to transfer from the input data to the temporary table with the document projection. You can use _ALL_ for all variables, _NUMERIC_ for all numeric variables, and other valid variable list names. By default, only the document ID (ID=) is transferred to the projected document table so that it can be used to join with the active table.
specifies not to use the noun group extractor. By default, the server extracts noun groups and returns maximal groups and subgroups (which do not include groups that contain determiners or prepositions). If stemming is turned on, then noun group elements are also stemmed.
Alias | NONG |
specifies not to stem words. By default, words are stemmed and terms such as "advises" and "advising" are mapped to the parent term "advise."
Alias | NOSTEM |
specifies not to tag terms. By default, terms are tagged and the server identifies a term's part of speech based on context clues. The identified part of speech is provided in the Role variable of the TERMS table.
specifies the number of terms to use in labeling a topic. By default, the n = 5 terms with the largest weight are used in constructing a label for the topic.
Alias | NLABELS= |
Default | 5 |
specifies the minimum document frequency of terms. By default, n = 4 and implies that a term is not kept for analysis unless it occurs in at least four documents.
Default | 4 |
saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.
specifies the results
the server should store as temporary tables. By default, the server
generates the Terms table, which contains terms, their parent-child
relationships, and weights. If you specify the NUMTOPICS= option,
the server also generates the Topics table. You can specify SELECT=(_ALL_)
to
generate all of the tables.
Table Name
|
Table Alias
|
Description
|
---|---|---|
TERMS
|
TERM
|
Contains summary information
about the terms in the document collection.
|
TERMDOC
|
BAGOFWORDS
BOW
PARENT
PARENTS
|
Contains a compressed
representation of the sparse term-by-document frequency matrix in
transactional style. The matrix is represented as a set of (row, column,
value) triples.
|
V
|
SVDV
|
Contains the V matrix
of the singular-value decomposition.
|
U
|
SUDV
|
Contains the rotated
U matrix of the singular-value decomposition.
|
PROJECTION
|
DOCPRO
PROJ
|
Contains the projections
of the columns of the term-by-document frequency matrix onto the columns
of U. Because each column of the term-by-document frequency matrix
corresponds to a document, the output forms a new representation of
the input documents in a space with much lower dimensionality.
|
TOPICS
|
|
Contains the topics
and a label constructed from the most highly weighted terms. This
is typically a small table, as the number of topics is limited by k,
the value of the singular-value decomposition or by the value specified
in the NUMTOPICS= option.
|
TERMTOPICS
|
TERMBYTOPICS
|
A sparse representation
of the terms by topic using the term ID and topic ID. This table might
be useful in joins involving terms or topics.
|
specifies the name of the in-memory table that contains the terms that are to be kept for the analysis. These terms are displayed in the Terms result table with a keep status of "Y." The START= table must have variable that is named Term and can also have a variable that is named Role.
Interaction | If you specify both the START= option and the STOP= option, the STOP= specification takes precedence. |
specifies the name of the in-memory table that contains the terms to exclude from the analysis. The STOP= table must contain a variable that is named Term and can also have a variable that is named Role.
Interaction | If you specify both the START= option and the STOP= option, the STOP= specification takes precedence. |
specifies how to perform the singular-value decomposition (SVD). The server carries out this decomposition whenever you request a result table that depends on topics, or if you request to save the V or U matrix of the decomposition. You can specify the following SVD options inside the parentheses:
specifies the number of dimensions to be extracted by SVD. This number is equal to the number of topics for topic generation. If you specify the TOPICS= (NUMTOPICS= ) option, then the value of k is automatically set to match the value given in the TOPICS= option.
Default | If you request topic generation and do not specify the K= or MAXK= option, then k = 10. |
Interaction | If you specify both the K= and MAXK= options, the K= option takes precedence. |
specifies the maximum value that the server should return as the recommended value of m. If the RESOLUTION= option is specified to recommend the value of k, then this option limits that value to at most m. The HPTMINE procedure attempts to calculate (as opposed to recommends) k dimensions when it performs the singular-value decomposition.
Interaction | If you specify both the K= and MAXK= options, the K= option takes precedence. |
specifies the recommended number of dimensions (resolution) for the singular value decomposition. If you specify this option, you must also specify the MAXK= option. A low-resolution singular value decomposition returns fewer dimensions than a high-resolution singular value decomposition. This option recommends the value of k (the number of topics) heuristically based on the value specified in the MAXK= option.
Alias | RES= |
specifies the maximum allowable tolerance for the singular value.
Default | The value of epsilon on the machine where the server is running. |
specifies the name of an in-memory table that contains user-defined synonyms to use in the analysis. The table specifies parent-child relationships that enable you to map child terms to a representative parent. The synonym relationship is indicated in the Terms result table and is also reflected in the term-by-document result table known as the Termdoc or Parent table.
Alias | SYN= |
specifies how terms are weighted. TERMWGT=ENTROPY specifies that terms are weighted using the entropy formulation. If you specify TERMWGT=MI, then terms are weighted using the mutual information formulation. Specifying TERMWGT=NONE suppresses term weighting. See the documentation for the HPTMINE procedure for the details about computing term weights.
Default | ENTROPY |
specifies the number of topics to generate. When you specify n, the server automatically produces a table of topics with up to n entries. You can also request the Topics table with the SELECT= option. Specifying TOPICS=n is equivalent to requesting topics based on a singular-value decomposition with n=k factors.
Alias | NUMTOPICS= |
Interaction | You can use the NUMLABELS= option to control the number of terms to use in labeling the topic. |
specifies a variable that contains information about the category that a document belongs to. If specified, the target variable is used in computing term weights. For example, it is used with TERMWGHT=MI.
specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.
Alias | TE= |
specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.
Alias | TN= |
ODS Table Name
|
Description
|
Option
|
---|---|---|
TextParseSummary
|
Summary Information
from parsing documents
|
Default
|