What's New in SAS Text Miner 3.1
Overview
SAS Text Miner 3.1 includes the following new features and
enhancements:
- new supported languages
- new encoding support
- new noun group extraction support for all
languages
- new entity types
- enhanced %TMFILTER macro features
- UNIX support
- enhanced features for synonym list
processing
- enhanced parsing
- new parsing procedure: DOCPARSE procedure
- new DOCSCORE DATA step scoring function
- improved performance
- eliminated XCMD requirement
- new quick find functionality in the Terms table of the
Interactive Results window
New Supported Languages
The new supported languages are Japanese, Korean,
Norwegian Bokmal, Simplified Chinese, and Traditional Chinese.
New Encoding Support
SAS Text Miner 3.1 supports encoding for the Unicode
standard, UTF-8, as well as the standard encodings for Chinese, Japanese, and
Korean.
New Noun Group Extraction Support
SAS Text Miner 3.1 supports noun group extraction for all
supported languages: English, Danish, Dutch, Finnish, French, German, Italian,
Japanese, Korean, Norwegian Bokmal, Portuguese, Simplified Chinese, Spanish,
Swedish, and Traditional Chinese.
New Entity Types
SAS Text Miner 3.1 supports these new entity types:
LANGUAGE (German and Spanish), PEOPLES (German and Spanish), PUBLICATION
(German), TICKER (English), and VEHICLE (English).
Enhanced %TMFILTER Macro Features
New Parameters
SAS Text Miner 3.1 includes the following new parameters:
- EXT =<EXTENSION list> accepts a list of document extensions that are
separated by spaces. When this option is used, only documents in the DIR
directory that match any of the specified extensions are processed.
- MANUAL= indicates that the CFS service, which is normally started by the
%TMFILTER macro, was started manually. This option is useful in situations in
which the socket connection mechanism is unable to function properly.
- NUMBYTES = replaces NUMCHARS =. NUMBYTES= controls the length of the text
field in bytes. The NUMCHARS = option has been superseded by the NUMBYTES =
parameter now that the %TMFILTER macro supports encodings in which a character
can be more than one byte long.
New Document Formats
The %TMFILTER macro supports new document formats such as
Microsoft Outlook and Outlook Express e-mail files and files that are created
with the Open Office suite.
Automatic Transcoding of Documents
Encodings are always automatically detected for each document and documents
are always automatically transcoded to the session encoding. If documents cannot
be transcoded correctly, the process will remove unsupported characters so
that the %TMFILTER macro can process the transcoded documents.
Additional %TMFILTER Macro Output
The %TMFILTER macro provides additional output:
- The SAS data set that the %TMFILTER macro generates contains the following
new variables:
- Extension - the extension of the original document.
- Created - the date and time that the original document was created.
- Accessed - the date and time that the original document was last
accessed.
- Modified - the date and time that the original document was last
modified.
- Size - the size of the document.
- The %TMFILTER macro generates summary statistics from
PROC FREQ that contains the frequencies of the languages that have been
identified and a two-way table of the omitted and truncated
variables.
UNIX Support
In SAS Text Miner 3.1, there is Solaris and AIX support, except for the %TMFILTER macro, which can be run on Windows only.
Enhanced Synonym List Processing
In SAS Text Miner 3.1, you can define synonyms with
different parts of speech. The synonym data set can have variables for both the
parent role and child role. Previously, only one role was provided. You can
define your synonym data set with the following variables:
- TERM, PARENT, and CATEGORY
- TERM, PARENT, TERMROLE, and
PARENTROLE
Enhanced Parsing
In SAS Text Miner 3.1, parsing has been enhanced as follows:
- Parents are automatically created for stemmed terms even though they never
occur in the training data. For example, suppose the term employees
occurs but employee never occurs in your training data set. SAS Text
Miner now automatically creates the parent term employee so that it
would be recognized as equivalent to employees if employee
should occur during scoring.
- Noun group elements are stemmed when the stemming
option is used. For example, the noun group amount of defects is
changed to amount of defect because the parent of defects is
defect.
- Terms composed of both alphabetic characters and
numeric digits are no longer dropped when the Numbers property is set to No.
In previous versions, they were dropped under this condition.
- Longer, more complex noun groups are found and the
smaller subgroups that compose them are also used.
- The parsing is now based on version 3.7.2 of Inxight's
LinguistX Platform and ThingFinder.
- All entity entries on the synonym list are now mapped
based on a direct string comparison between the entry in the synonym list and
the lower-cased occurrence in the text. Previous versions required first that
the occurrence of the entry in the text was detected as an entity, prior to
making the synonym assignment.
New DOCPARSE Procedure
A new parsing procedure, PROC DOCPARSE, parses text documents and organizes
the terms and their frequencies into data sets. The DOCPARSE procedure is
portable to multiple platforms, and it does not require XCMD.
New DOCSCORE Function
The new DOCSCORE function is called inside DATA step code. It takes a textual
variable (or a reference to a document that contains text) along with
information from the training run and generates a compressed term-document
frequency data set called OUT.
Improved Performance
In SAS Text Miner 3.1, parsing speed is improved. Documents are processed
faster than previous Text Miner releases.
Eliminated XCMD Requirement
Previous versions of SAS Text Miner have required that the parsing be done
using an XCMD call on the SAS server. In SAS Text Miner 3.1, this is no longer
necessary when running the Text Miner node. However, the %TMFILTER macro still
uses an XCMD requirement, so it must still be issued in a SAS session that
permits XCMD calls.
New Quick Find Functionality in the Terms Table of the Interactive Results
Window
Quick find enables users to scroll quickly to a specific spot in a sorted
column of the Terms table by typing a single character while the column is
active. Quick find can be used in the Term, Freq, #Docs, Weight, Role, and
Attribute columns.
Contains LinguistX ® from Inxight Software, Inc. Copyright ©
1996-2006. All rights reserved. www.inxight.com.
Contains ThingFinderTM Server from Inxight
Software, Inc. Copyright © 1996-2006. All rights reserved. www.inxight.com.