Pattern Logic Node

DataFlux Data Management Studio 2.7: User Guide

Pattern Logic Node

The Pattern Logic Node uses the categories previously assigned to input words by morphological analysis to generate possible parse solutions. At the end of the morph analysis step, each word (or substring) in the input has been categorized. However, it is likely that:

a single word may have more than one possible category, and
categories by themselves, even if unambiguous, do not give the final answer expected

The Pattern Logic Node solves both of these problems by considering each word and its categories in the context provided by the full string, and with the aid of a grammar that supplies information about the allowed structure of inputs.

Used in:

Within Locale Guess definitions, chopped words found in a given pattern logic node will not be considered in subsequent pattern logic nodes. For example suppose there are two pattern logic nodes and the chopped words are: this, is, an, example. If the first pattern logic node finds a pattern match for the words ‘this’ and ‘is’, then these two words will not be searched for in the second pattern logic node. So even if the second pattern logic node would find a pattern match on ‘this’ and ‘is’, this pattern match within the second pattern logic node will never happen because the words will not be considered since the first pattern logic node already found a pattern match.

Properties

All Definition Types

The following properties are common to all definitions that include Pattern Logic Nodes.

Grammar

Select a grammar to use. A grammar describes all the categories that words might have within a particular context, such as names or addresses. A grammar also describes the relationships between categories. Some categories are derived from a combination of others, and there may be many different ways to derive a particular category.

By default, only files from the locale and its ancestors appear in the drop-down. If you do not see the desired library, click Tools > Data Management Studio Options. From the column on the left, select QKB Definition Editor. Then, in the pane on the right under Library files, click Show files for all locales. QKB files from all locales will be included.

Click Open grammar to edit the selected file.

For example, in the United States, a phrase that represents a person's family name (the derived category) can commonly be seen in the following constructions:

single family-name word (for example, "John SMITH")

two family-name words (for example, "Helena BONHAM CARTER")

two family-name words with a hyphen (for example, "Mary JONES-SMITH")

Multiple family-name words, with hyphens or without

Single initial (for example, "John S.", last name abbreviated to protect a person's identity), and the list expands greatly as different languages, countries, cultures, and customs are taken into account

In the simplest case, the structure of a grammar looks like a single tree. Some grammars may be made of multiple, unconnected trees.

Root Category

This is the category at the root of the desired sub-tree of the category hierarchy, as defined by the grammar. This may be anything from the true root of a single-tree grammar, the root of a tree in a multiple-tree grammar, or simply any category, depending on the intent.

Optimization Parameters

Parse resource limit - The maximum amount of computational resources that the parser should use. If the parser requires more resources for a particular input than are allowed by this parameter, the parser will abandon the attempt to parse that string. Choosing a high resource limit allows the parser to use more resources when parsing an input string, meaning there is less chance that parses will be abandoned for complex strings. However, a high resource limit also means that jobs may take longer to execute.

Solution tree depth - The maximum depth within a solution tree to which the parser will search when computing scores to compare solutions for a single input. Choosing a higher value allows the parser to search deeper into solution trees, meaning that rules at lower levels of a tree will impact the score for that tree.

Input length - Optimizes processing for different lengths of input string. A different method of generating solution trees is used for short strings than for long strings. If Auto is selected, the parser automatically determines which method to use. Auto is the recommended value for this parameter.

Note Note: Changing parse optimization parameters is an advanced activity. It is recommended that you not change the values of these parameters in QKB definitions provided by SAS. If you wish to learn more about the effects of changing these parameters, you might want to attend a SAS QKB Customization training course. For information about training courses available from SAS, contact SAS Technical Support.

Identification Analysis and Locale Guess Definitions only

Sought category

This is the category of interest within that sub-tree whose root is the root category.

Stop searching all pattern logic nodes if a match occurs in this node

If this check box is selected, processing of patterns stops after this node if a pattern match is found. If there is a pattern match on one or more of the chopped words in the node when this check box is selected, then no subsequent pattern logic nodes are used.

Search for patterns in

Substrings - The pattern is matched against all input substrings
Full phrase - The pattern is matched only against the entire input string

Identification Analysis Definition only

Identity

The identity to be assigned if the pattern matches.

Pattern weight

This number weights the final calculation in the pattern analysis.

Locale Guess Definition only

Likelihood

The likelihood of the input being in the definition's locale, if the pattern matches. Setting to "Never" implies that an input that matches this pattern could never belong to the definition's locale.

Extraction Definition only

Category token mappings

This table displays an ordered list of categories that are sought to match with input substrings, using the specified grammar. The best category match in the list causes the specified token to be applied to the input string. The table can therefore include multiple rows for the same category, each with a different token value. In the table you can reposition rows vertically for easier ready by clicking the up and down arrows. To add and remove rows, use the plus and minus arrows.

Pattern weight

This number weights the final calculation in the pattern analysis.

Maximum matches per pattern

The maximum number of matches that will be processed for this pattern.

Search for patterns in

Substrings - The pattern is matched against all input substrings
Full phrase - The pattern is matched only against the entire input string

Output

Message

One of the following:

Changes applied - if the pattern was found and generated some solutions
Substring matched basic category - if no solutions were found.
Parse abandoned - if computational resources were insufficient to complete the parsing operation (Parse Definition only)

Solution trees

If any solutions were found for a test string, a number of solution trees appear in the output pane. Depending on the definition, they will be organized and displayed in different ways.

each pattern up to and including the pattern of the currently selected node can generate its own set of solutions
solution display is cumulative (for example, solutions found by previous patterns will be displayed along with solutions for the current node, if any)
for each pattern, the solutions are sorted from left to right in decreasing score order
the number of solutions shown can be configured under Tools > Options > Display

Solution tree structure

--+ ROOT CATEGORY (Likelihood)

|

|--o string

|
|--o SUBCATEGORY1 SUBCATEGORY2...

The top of the tree shows the root category and the likelihood associated with it. The first bullet (blue) shows the string. The second bullet (yellow) shows the subcategories that form the root category when combined according to the rules of the grammar. Following this are similar sub-trees for each of the subcategories. Then structure then recurses into each category until only basic (non-derived categories) are a part of the tree, or until the maximum solution tree depth specified by the node is reached.

If there are multiple solutions generated for a pattern, the one with the highest score is used to determine the final result.

Parse Definition

The Parse Definition has only one pattern. Therefore, only one set of solutions appear, at most.

Identification Analysis Definition

Each pattern that generated solutions (up to and including the currently selected Node) has a row of solution trees. For viewing convenience, some information about the pattern is shown to the left of its solution tree row.

Extraction Definition

The Extraction Definition processes the input string in parallel across all Pattern Logic Nodes. If a node detects a word that matches its category, that word is extracted from the string. A second iteration then examines the new, shorter, substring. Iterations continue until all words have been extracted. In the Testing area, the left side of the output pane displays a substring tree. Each node in the tree represents an iteration. Iterations that expand will display the words that were extracted in that iteration. The same word or substring may appear more than once in the tree if multiple-token extraction is selected in the Extraction Definition Head Node.

Click a word to display its solution tree on the right side of the output pane. Included in the solution tree is the number of the Pattern Logic Node that generated the solution. Note that the test output for the Pattern Logic Summary Node combines the output from all Pattern Logic Nodes.

Locale Guess Definition

The left output pane in the Test: area contains a tree representing the pattern logic node that was selected. The tree contains a node for each substring of the test value that was extracted by a pattern. Next to the substring is the number of solutions found. When you click on the top node of the tree, the right output pane contains a summary of the confidence values. The summary includes the confidence values for the selected pattern logic node and all pattern logic nodes that appear before the selected one in the Flow diagram. If you click on a substring, the solution trees for that substring are displayed in the right pane. The absence of nodes in the pattern logic tree means that no patterns were found.

Note Note: Locale Guess definitions created using older versions of DataFlux Data Management Studio or DataFlux dfPower Studio do not have a grammar selected in the Pattern Logic node. For these definitions a pattern can still be found due to a category match. When a pattern is found for these definitions, the number of solutions found will be zero. Clicking on a substring displays the message "Substring matched basic category" in the right output pane.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: DMCust_12327.html