What’s New in SAS Contextual Extraction Studio 5.2

Overview

New and enhanced features in SAS Contextual Extraction Studio include the following:
  • Added coreference operators facilitate rule-writing precision.
  • XML fields can be specified for matches.
  • Additional operators enable greater rule matching precision.
  • Case-insensitive matching and comments in rules are now enabled.

Coreference Operators Added

Coreference refers to pronoun resolution. A pronoun is matched to the antecedent that it refers to when you use these operators in your contextual extraction concept rules:
  • Use the coreference operator (_ref ) to link a matched string with its canonical form.
  • Use _coref with CLASSIFIER definitions.
  • Use the forward ( _F ) and the preceding (_P ) symbols to restrict coreference matches.
  • Assign a new concept name for a match on a term specified by the _ref operator.

XML Field Specified for Matching

Limit matches to specific XML fields when you write these fields into rules and apply them to input XML documents.

Additional Operators for Precision

Additional operators enable greater rule matching precision. These operators include:
  • Specify a stemming symbol to enable SAS Contextual Extraction Studio to match all word forms, or only all noun or verb forms.
  • Specify the paragraph symbol (PARA) to enable SAS Contextual Extraction Studio to match all word forms, or only all noun or verb forms.
  • Write a SENT_n operator into a rule to specify the maximum number of sentences where a match can occur.
  • Use a SENTSTART_n operator to specify the number of words at the beginning of a sentence where a match can occur.
  • Use a SENTEND_n operator to specify the number of words at the end of a sentence where a match can occur.

Case-Insensitive Matching and Comments

Case-insensitive matching occurs when you select the Case Insensitive Matching check box in the Data tab for a contextual extraction concept. (By default, all matching is case sensitive.)
You can also add comments to your rules using the pound character ( # ).