Customize - Chop Table

You are here: Customizing Quality Knowledge Bases>Customize - Chop Table

DataFlux Data Management Studio 2.5: User Guide

Customize - Chop Table

Before using the Customize Chop Table Editor, you should define all of the chopping characteristics for each value in the default character table.

A chop table is a collection of character-level rules used to create an ordered word list from an input string. Each character in the default character table has both a classification and an operation specified. You should build one chop table per parse definition. The Chop Table Editor allows you to build a chop table.

Table

Unicode Block - The Unicode Block drop-down list provides a list of character subsets, see Unicode Block for the complete list.

Character Name - The Character Name is the actual name of the character, for example, semicolon.

Character - The Character represents the actual appearance of the character, for example the Character Name is semicolon and the Character is ;.

Classification - The Classification drop-down list includes:

LETTER/SYMBOL - a letter or non-separating symbol
NUMBER - a numeric digit (0-9)
LEAD SEPARATOR - a delimiter attached to the beginning of a word (for example, the left parenthesis)
TRAIL SEPARATOR - a delimiter attached at the end of a word (for example, a period)
FULL SEPARATOR - a delimiting character (for example, space, dash, and comma)

Operation - The Operation drop-down list includes:

USE - use the character as-is in the word list and output tokens
TRIM - omit from word list; trim leading/trailing characters in output tokens
SUPPRESS - omit from the word list and output tokens

Value - This is the ordinal value of the Unicode code point.

Hex Value - The hexadecimal value of the Value (above).

Rules

The rules-based chopping algorithm works by matching portions of an input string from left to right using search criteria and states. These are specified by rules which can be defined from the Rules tab. These rules are processed from top to bottom and the search criteria includes:

A vocabulary of words
A single regular expression

The system uses the search criteria to first match the input string at the current position. If it fails, it proceeds to the next rule and attempts to match using the criteria for that rule, and so on. If no rules match, the input string position is advanced by one character and the algorithm run again. This process is repeated until the process reaches the end of the input string.

If the system is able to successfully match a substring, it then attempts to validate the state. At any point in time, the system maintains a state of zero or more flags, which are just variables that either exist or not. Each rule has the ability to check for the existence of flags in the current state using a simple Boolean syntax. This is called the Prerequisite State Condition. If this validation fails, the rule fails to match and the next rule is checked, and so on. If it succeeds, then the following occurs:

The input string is advanced to the end of the successful match in the Search criterion.
A new state (Output State) consisting of zero or more flags is set. The new state replaces the old state.
The string is chopped before the current match, after the current match, at both points, or not at all.

This part of the Rules tab sets up an initial state before any rule processing is performed. This is useful for identifying a pre-existing state in order to force certain rules first.

The Initial Flags list has two controls. Click the Add icon to open the Add New Flag dialog. Here, you will enter the name of the new initial flag. As you create new flags, it is added alphabetically to the list. To delete an existing flag, select the flag and click the Delete icon.

Rules

This section displays each rule and its associated components, in order.

Method - The Method is either Vocab or Regex, depending on the type of criterion for the rule. This determines the format for the Search Criterion column.

Search Criterion - This section displays the search criterion. The exact format differs, depending on the method selected for the rule. If the method is Regex, then this field displays a regular expression. If the method is Vocab, then this field displays the name of the vocabulary selected, a comma, and the name of the category in the vocabulary.

Prerequisite State Condition - This displays a logical Boolean expression describing the desired flag configuration of the current state in order for the rule to match. It consists of flag names separated by various operators. The expression is parsed left to right, with precedence given to sub-expressions in parenthesis:

Operator	Description
\|	OR operator
&	AND operator
!	NOT operator
()	Parenthetical operators, for precedence

Output State - This field displays a comma-separated list of flag names that become the current state should this rule be matched.

Chop Mode - This option tells the system what to do with the input string when the match is successful. The possibilities include:

Value	Description
None	No chop. This option is helpful if you want to change the state on a certain condition.
Before	Chop the string before the match.
After	Chop the string after the match.
Both	Chop the string before and after the match.

Notes - This is a comment field used to display informational messages about each rule.

Use the buttons on the right to create, edit, and delete rules.

Icon	Description
	Add new rule
	Edit rule
	Delete rule
	Move rule up
	Move rule down

When you click Add a rule, the Add Rule dialog opens.

Matching Method - In the Matching Method section, select Regular Expression or Vocabulary. When you select Regular Expression you can type the regex directly into the accompanying field. If you select Vocabulary, you must select one of the available vocabularies from the drop-down list. When you select a vocabulary, the Category drop-down list becomes active. All of the possible categories for the selected vocabulary appear. The All category is always present in this list. This category allows all words in the vocabulary to be used as possible matches.

Prerequisite State Condition - This is a Boolean text expression.

Output State - The Output State list can be populated with flag names just like Initial Flags.

Chopping Mode - The Chopping Mode drop-down list allows you to select one of the possible modes.

Notes - The Notes field allows you to add 128 characters of text. This is used for documentation purposes.

Testing

The test area is used to test input strings against the chopping configuration.

Input string

The Input string field accepts any form of text input.

Go - Click Go to run the string through the Chop Table Editor for a result. The result is displayed in the Result section.

Clear - Click Clear to clear both the Input string and Result fields.

Result

The Result section includes two columns, Phrase and Source.

Phrase - The Phrase column shows the chopped substring.

Source - The Source column displays whether the corresponding chopped phrase originated from the table or the rules by displaying Table or Rules. The first rows in this table always show None in the Source column because this represents some beginning prefix of the input string that was not chopped. All subsequent rows have an explicit source specified.

Double-click any row in the Result table to see why the substring on that row was chopped.