IMSTAT Procedure (Analytics)

TEXTPARSE Statement

The TEXTPARSE statement performs text analytics on the active in-memory table. You can separate the documents in the table into terms, derive topics based on weighted term frequencies, and project the active table onto the latent space defined by the topic-discovered topics.

See: For backround information, see Text Analytics in SAS LASR Analytic Server.

Syntax

TEXTPARSE TXT=text-variable ID=document-ID <options>;

Required Arguments

TXT=text-variable

specifies the name of the variable that contains the text to analyze.

ID=document-ID

specifies the name of the variable that identifies the documents in the table uniquely. The values are typically a row number or other value that identifies the rows. The document ID is important to perform joins of the result tables.

Alias DOCID=

TEXTPARSE Statement Options

CELLWGT= NONE | LOG

specifies how elements in the term × document matrix are weighted. Elements in the matrix are assigned weight wi * g(fij), where wi is the term weight for the ith term, fij is the frequency of appearance of this term in document j.

If CELLWGT=LOG, then g(fij) = log2(fij+1). The logarithmic function tempers the influence of very frequent terms.
Default LOG

ENTITIES= NONE | STD

determines whether the entity extractor should use the standard list of entities. When ENTITIES=STD, entity extraction is enabled and standard entities are used. Terms such as "George W. Bush" are then recognized as an entity and given the corresponding entity role and attribute. For this example, the entity role is PERSON and the attribute is Entity. Although the entity is treated as the single term, "george w bush," the individual tokens "george," "w," and "bush" are also included.

Default NONE

EXACTWEIGHT

specifies not to round the weights that are aggregated during topic derivation. By default, the calculated weights are rounded to the nearest .001.

Alias NOWTRND

KEEP=(variable-list)

KEEP=variable-name

specifies one or more variables to transfer from the input data to the temporary table with the document projection. You can use _ALL_ for all variables, _NUMERIC_ for all numeric variables, and other valid variable list names. By default, only the document ID (ID=) is transferred to the projected document table so that it can be used to join with the active table.

NONOUNGROUPS

specifies not to use the noun group extractor. By default, the server extracts noun groups and returns maximal groups and subgroups (which do not include groups that contain determiners or prepositions). If stemming is turned on, then noun group elements are also stemmed.

Alias NONG

NOSTEMMING

specifies not to stem words. By default, words are stemmed and terms such as "advises" and "advising" are mapped to the parent term "advise."

Alias NOSTEM

NOTAGGING

specifies not to tag terms. By default, terms are tagged and the server identifies a term's part of speech based on context clues. The identified part of speech is provided in the Role variable of the TERMS table.

NUMLABELS=n

specifies the number of terms to use in labeling a topic. By default, the n = 5 terms with the largest weight are used in constructing a label for the topic.

Alias NLABELS=
Default 5

REDUCEF=n

specifies the minimum document frequency of terms. By default, n = 4 and implies that a term is not kept for analysis unless it occurs in at least four documents.

Default 4

SAVE=table-name

saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.

SELECT <=> (list-of-temporary-tables)

specifies the results the server should store as temporary tables. By default, the server generates the Terms table, which contains terms, their parent-child relationships, and weights. If you specify the NUMTOPICS= option, the server also generates the Topics table. You can specify SELECT=(_ALL_) to generate all of the tables.

The possible values for the list specification are shown in the following table:
Table Name
Table Alias
Description
TERMS
TERM
Contains summary information about the terms in the document collection.
TERMDOC
BAGOFWORDS
BOW
PARENT
PARENTS
Contains a compressed representation of the sparse term-by-document frequency matrix in transactional style. The matrix is represented as a set of (row, column, value) triples.
V
SVDV
Contains the V matrix of the singular-value decomposition.
U
SUDV
Contains the rotated U matrix of the singular-value decomposition.
PROJECTION
DOCPRO
PROJ
Contains the projections of the columns of the term-by-document frequency matrix onto the columns of U. Because each column of the term-by-document frequency matrix corresponds to a document, the output forms a new representation of the input documents in a space with much lower dimensionality.
TOPICS
Contains the topics and a label constructed from the most highly weighted terms. This is typically a small table, as the number of topics is limited by k, the value of the singular-value decomposition or by the value specified in the NUMTOPICS= option.
TERMTOPICS
TERMBYTOPICS
A sparse representation of the terms by topic using the term ID and topic ID. This table might be useful in joins involving terms or topics.
For information about the tables, see Output Tables for the TEXTPARSE Statement.

START=table-name

specifies the name of the in-memory table that contains the terms that are to be kept for the analysis. These terms are displayed in the Terms result table with a keep status of "Y." The START= table must have variable that is named Term and can also have a variable that is named Role.

Interaction If you specify both the START= option and the STOP= option, the STOP= specification takes precedence.

STOP=

specifies the name of the in-memory table that contains the terms to exclude from the analysis. The STOP= table must contain a variable that is named Term and can also have a variable that is named Role.

Interaction If you specify both the START= option and the STOP= option, the STOP= specification takes precedence.

SVD(singular-value-decomposition-options)

specifies how to perform the singular-value decomposition (SVD). The server carries out this decomposition whenever you request a result table that depends on topics, or if you request to save the V or U matrix of the decomposition. You can specify the following SVD options inside the parentheses:

K=k

specifies the number of dimensions to be extracted by SVD. This number is equal to the number of topics for topic generation. If you specify the TOPICS= (NUMTOPICS= ) option, then the value of k is automatically set to match the value given in the TOPICS= option.

If the value of k is too large, then the server might process for an unnecessarily long time.
Default If you request topic generation and do not specify the K= or MAXK= option, then k = 10.
Interaction If you specify both the K= and MAXK= options, the K= option takes precedence.

MAXK=m

specifies the maximum value that the server should return as the recommended value of m. If the RESOLUTION= option is specified to recommend the value of k, then this option limits that value to at most m. The HPTMINE procedure attempts to calculate (as opposed to recommends) k dimensions when it performs the singular-value decomposition.

Interaction If you specify both the K= and MAXK= options, the K= option takes precedence.

RESOLUTION=LOW | MED | HIGH

specifies the recommended number of dimensions (resolution) for the singular value decomposition. If you specify this option, you must also specify the MAXK= option. A low-resolution singular value decomposition returns fewer dimensions than a high-resolution singular value decomposition. This option recommends the value of k (the number of topics) heuristically based on the value specified in the MAXK= option.

Assume that the MAXK=n option and the singular value decomposition with n dimensions accounts for t% of the total variance. If you specify RES=HIGH, the server always recommends the maximum number of dimensions. That is, k=n. If you specify RES=MED, the server recommends a value for k that explains (5/6) × t% of the total variance. If you specify RES=LOW, the server recommends a value for k that explains (2/3) × t% of the total variance.
Alias RES=

TOL=ɛ

specifies the maximum allowable tolerance for the singular value.

Default The value of epsilon on the machine where the server is running.

SYNONYMS=table-name

specifies the name of an in-memory table that contains user-defined synonyms to use in the analysis. The table specifies parent-child relationships that enable you to map child terms to a representative parent. The synonym relationship is indicated in the Terms result table and is also reflected in the term-by-document result table known as the Termdoc or Parent table.

The specified table must have either the two variables Term and Parent, or the four variables Term, Parent, Termrole, and Parentrole. When stemming is enabled (the default), the relationships provided by the SYNONYMS= table take precedence over relationships that are identified through term stemming.
Alias SYN=

TERMWGT=ENTROPY | MI | NONE

specifies how terms are weighted. TERMWGT=ENTROPY specifies that terms are weighted using the entropy formulation. If you specify TERMWGT=MI, then terms are weighted using the mutual information formulation. Specifying TERMWGT=NONE suppresses term weighting. See the documentation for the HPTMINE procedure for the details about computing term weights.

If you specify TERMWGT=MI, then you must specify a target variables with the TARGET= option.
Default ENTROPY

TOPICS=n

specifies the number of topics to generate. When you specify n, the server automatically produces a table of topics with up to n entries. You can also request the Topics table with the SELECT= option. Specifying TOPICS=n is equivalent to requesting topics based on a singular-value decomposition with n=k factors.

Alias NUMTOPICS=
Interaction You can use the NUMLABELS= option to control the number of terms to use in labeling the topic.

TARGET=target-variable

specifies a variable that contains information about the category that a document belongs to. If specified, the target variable is used in computing term weights. For example, it is used with TERMWGHT=MI.

TEMPEXPRESS="SAS-expressions"

TEMPEXPRESS=file-reference

specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.

Alias TE=

TEMPNAMES=variable-name

TEMPNAMES=(variable-list)

specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.

Alias TN=

Details

ODS Table Names

The TEXTPARSE statement generates the following ODS table.
ODS Table Name
Description
Option
TextParseSummary
Summary Information from parsing documents
Default
The ODS table includes the temporary tables names for the tables that are requested in the SELECT = (list-of-temporary-tables).
For information about using the ODS table with SAVE= option, see the Details section of the STORE statement.