IMSTAT Procedure (Analytics)

TEXTPARSE Statement

The TEXTPARSE statement performs text analytics on the active in-memory table. You can separate the documents in the table into terms, derive topics based on weighted term frequencies, and project the active table onto the latent space defined by the topic-discovered topics.

Syntax

Required Arguments

TEXTPARSE Statement Options

Syntax

TEXTPARSE TXT=text-variable ID=document-ID <options>;

Required Arguments

TXT=text-variable

specifies the name of the variable that contains the text to analyze.

ID=document-ID

specifies the name of the variable that identifies the documents in the table uniquely. The values are typically a row number or other value that identifies the rows. The document ID is important to perform joins of the result tables.

Alias

DOCID=

TEXTPARSE Statement Options

CELLWGT= NONE | LOG

specifies how elements in the term × document matrix are weighted. Elements in the matrix are assigned weight w_i * g(f_ij), where w_i is the term weight for the ith term, f_ij is the frequency of appearance of this term in document j.

If CELLWGT=LOG, then g(f_ij) = log₂(f_ij+1). The logarithmic function tempers the influence of very frequent terms.

Default

LOG

ENTITIES= NONE | STD

determines whether the entity extractor should use the standard list of entities. When ENTITIES=STD, entity extraction is enabled and standard entities are used. Terms such as "George W. Bush" are then recognized as an entity and given the corresponding entity role and attribute. For this example, the entity role is PERSON and the attribute is Entity. Although the entity is treated as the single term, "george w bush," the individual tokens "george," "w," and "bush" are also included.

Default

NONE

EXACTWEIGHT

specifies not to round the weights that are aggregated during topic derivation. By default, the calculated weights are rounded to the nearest .001.

Alias

NOWTRND

KEEP=(variable-list)

KEEP=variable-name

specifies one or more variables to transfer from the input data to the temporary table with the document projection. You can use _ALL_ for all variables, _NUMERIC_ for all numeric variables, and other valid variable list names. By default, only the document ID (ID=) is transferred to the projected document table so that it can be used to join with the active table.

NONOUNGROUPS

specifies not to use the noun group extractor. By default, the server extracts noun groups and returns maximal groups and subgroups (which do not include groups that contain determiners or prepositions). If stemming is turned on, then noun group elements are also stemmed.

Alias

NONG

NOSTEMMING

specifies not to stem words. By default, words are stemmed and terms such as "advises" and "advising" are mapped to the parent term "advise."

Alias

NOSTEM

NOTAGGING

specifies not to tag terms. By default, terms are tagged and the server identifies a term's part of speech based on context clues. The identified part of speech is provided in the Role variable of the TERMS table.

NUMLABELS=n

specifies the number of terms to use in labeling a topic. By default, the n = 5 terms with the largest weight are used in constructing a label for the topic.

Alias	NLABELS=
Default	5

REDUCEF=n

specifies the minimum document frequency of terms. By default, n = 4 and implies that a term is not kept for analysis unless it occurs in at least four documents.

Default

SAVE=table-name

saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.

SELECT <=> (list-of-temporary-tables)

specifies the results the server should store as temporary tables. By default, the server generates the Terms table, which contains terms, their parent-child relationships, and weights. If you specify the NUMTOPICS= option, the server also generates the Topics table. You can specify SELECT=(_ALL_) to generate all of the tables.

The possible values for the list specification are shown in the following table:

Table Name	Table Alias	Description
TERMS	TERM	Contains summary information about the terms in the document collection.
TERMDOC	BAGOFWORDS BOW PARENT PARENTS	Contains a compressed representation of the sparse term-by-document frequency matrix in transactional style. The matrix is represented as a set of (row, column, value) triples.
V	SVDV	Contains the V matrix of the singular-value decomposition.
U	SUDV	Contains the rotated U matrix of the singular-value decomposition.
PROJECTION	DOCPRO PROJ	Contains the projections of the columns of the term-by-document frequency matrix onto the columns of U. Because each column of the term-by-document frequency matrix corresponds to a document, the output forms a new representation of the input documents in a space with much lower dimensionality.
TOPICS		Contains the topics and a label constructed from the most highly weighted terms. This is typically a small table, as the number of topics is limited by k, the value of the singular-value decomposition or by the value specified in the NUMTOPICS= option.
TERMTOPICS	TERMBYTOPICS	A sparse representation of the terms by topic using the term ID and topic ID. This table might be useful in joins involving terms or topics.

START=table-name

specifies the name of the in-memory table that contains the terms that are to be kept for the analysis. These terms are displayed in the Terms result table with a keep status of "Y." The START= table must have variable that is named Term and can also have a variable that is named Role.

Interaction

If you specify both the START= and the STOP= options, the STOP= specification takes precedence.

STOP=

specifies the name of the in-memory table that contains the terms to exclude from the analysis. The STOP= table must contain a variable that is named Term and can also have a variable that is named Role.

Interaction

If you specify both the START= and the STOP= options, the STOP= specification takes precedence.

SVD(singular-value-decomposition-options)

specifies how to perform the singular-value decomposition (SVD). The server carries out this decomposition whenever you request a result table that depends on topics, or if you request to save the V or U matrix of the decomposition. You can specify the following SVD options inside the parentheses:

K=k

specifies the number of dimensions to be extracted by SVD. This number is equal to the number of topics for topic generation. If you specify the TOPICS= (NUMTOPICS= ) option, then the value of k is automatically set to match the value given in the TOPICS= option.

If the value of k is too large, then the server might process for an unnecessarily long time.

Default	If you request topic generation and do not specify the K= or MAXK= option, then k = 10.
Interaction	If you specify both the K= and MAXK= options, the K= option takes precedence.

MAXK=m

specifies the maximum value that the server should return as the recommended value of m. If the RESOLUTION= option is specified to recommend the value of k, then this option limits that value to at most m. The HPTMINE procedure attempts to calculate (as opposed to recommends) k dimensions when it performs the singular-value decomposition.

Interaction

If you specify both the K= and MAXK= options, the K= option takes precedence.

RESOLUTION=LOW | MED | HIGH

specifies the recommended number of dimensions (resolution) for the singular value decomposition. If you specify this option, you must also specify the MAXK= option. A low-resolution singular value decomposition returns fewer dimensions than a high-resolution singular value decomposition. This option recommends the value of k (the number of topics) heuristically based on the value specified in the MAXK= option.

Assume that the MAXK=n option and the singular value decomposition with n dimensions accounts for t% of the total variance. If you specify RES=HIGH, the server always recommends the maximum number of dimensions. That is, k=n. If you specify RES=MED, the server recommends a value for k that explains (5/6) × t% of the total variance. If you specify RES=LOW, the server recommends a value for k that explains (2/3) × t% of the total variance.

Alias

RES=

TOL=ɛ

specifies the maximum allowable tolerance for the singular value.

Default

The value of epsilon on the machine where the server is running.

SYNONYMS=table-name

specifies the name of an in-memory table that contains user-defined synonyms to use in the analysis. The table specifies parent-child relationships that enable you to map child terms to a representative parent. The synonym relationship is indicated in the Terms result table and is also reflected in the term-by-document result table known as the Termdoc or Parent table.

The specified table must have either the two variables Term and Parent, or the four variables Term, Parent, Termrole, and Parentrole. When stemming is enabled (the default), the relationships provided by the SYNONYMS= table take precedence over relationships that are identified through term stemming.

Alias

SYN=

TERMWGT=ENTROPY | MI | NONE

specifies how terms are weighted. TERMWGT=ENTROPY specifies that terms are weighted using the entropy formulation. If you specify TERMWGT=MI, then terms are weighted using the mutual information formulation. Specifying TERMWGT=NONE suppresses term weighting. See the documentation for the HPTMINE procedure for the details about computing term weights.

If you specify TERMWGT=MI, then you must specify a target variables with the TARGET= option.

Default

ENTROPY

TOPICS=n

specifies the number of topics to generate. When you specify n, the server automatically produces a table of topics with up to n entries. You can also request the Topics table with the SELECT= option. Specifying TOPICS=n is equivalent to requesting topics based on a singular-value decomposition with n=k factors.

Alias	NUMTOPICS=
Interaction	You can use the NUMLABELS= option to control the number of terms to use in labeling the topic.

TARGET=target-variable

specifies a variable that contains information about the category that a document belongs to. If specified, the target variable is used in computing term weights. For example, it is used with TERMWGHT=MI.

TEMPEXPRESS="SAS-expressions"

TEMPEXPRESS=file-reference

specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.

Alias

TE=

TEMPNAMES=variable-name

TEMPNAMES=(variable-list)

specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.

Alias

TN=