Functions and CALL Routines |
Category: | Character |
Restriction: | I18N Level 0 |
Syntax |
COMPGED(string-1, string-2 <,cutoff> <,modifiers>) |
is a numeric constant, variable, or expression. If the actual generalized edit distance is greater than the value of cutoff, the value that is returned is equal to the value of cutoff.
Tip: | Using a small value of cutoff improves the efficiency of COMPGED if the values of string-1 and string-2 are long. |
specifies a character string that can modify the action of the COMPGED function. You can use one or more of the following characters as a valid modifier:
Tip: | COMPGED ignores blanks that are used as modifiers. |
Details |
The order in which the modifiers appear in the COMPGED function is relevant.
"LN" first removes leading blanks from each string and then removes quotation marks from n-literals.
"NL" first removes quotation marks from n-literals and then removes leading blanks from each string.
Generalized edit distance is a generalization of Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of deletions, insertions, or replacements of single characters that are required to transform string-1 into string-2.
The COMPGED function returns the generalized edit distance between string-1 and string-2. The generalized edit distance is the minimum-cost sequence of operations for constructing string-1 from string-2.
The algorithm for computing the sum of the costs involves a pointer that points to a character in string-2 (the input string). An output string is constructed by a sequence of operations that might advance the pointer, add one or more characters to the output string, or both. Initially, the pointer points to the first character in the input string, and the output string is empty.
The operations and their costs are described in the following table.
To set the cost of the string operations, you can use the CALL COMPCOST routine or use default costs. If you use the default costs, the values that are returned by COMPGED are approximately 100 times greater than the values that are returned by COMPLEV.
The rationale for determining the generalized edit distance is based on the number and types of typographical errors that can occur. COMPGED assigns a cost to each error and determines the minimum sum of these costs that could be incurred. Some types of errors can be more serious than others. For example, inserting an extra letter at the beginning of a string might be more serious than omitting a letter from the end of a string. For another example, if you type a word or phrase that exists in string-2 and introduce a typographical error, you might produce string-1 instead of string-2.
Generalized edit distance is not necessarily symmetric. That is, the value that is returned by COMPGED(string1, string2) is not always equal to the value that is returned by COMPGED(string2, string1) . To make the generalized edit distance symmetric, use the CALL COMPCOST routine to assign equal costs to the operations within each of the following pairs:
Comparisons |
You can compute the Levenshtein edit distance by using the COMPLEV function. You can compute the generalized edit distance by using the CALL COMPCOST routine and the COMPGED function. Computing generalized edit distance requires considerably more computer time than does computing Levenshtein edit distance. But generalized edit distance usually provides a more useful measure than Levenshtein edit distance for applications such as fuzzy file merging and text mining.
Examples |
The following example uses the default costs to calculate the generalized edit distance.
options nodate pageno=1 linesize=70 pagesize=60; data test; infile datalines missover; input String1 $char8. +1 String2 $char8. +1 Operation $40.; GED=compged(string1, string2); datalines; baboon baboon match baXboon baboon insert baoon baboon delete baXoon baboon replace baboonX baboon append baboo baboon truncate babboon baboon double babon baboon single baobon baboon swap bab oon baboon blank bab,oon baboon punctuation bXaoon baboon insert+delete bXaYoon baboon insert+replace bXoon baboon delete+replace Xbaboon baboon finsert aboon baboon trick question: swap+delete Xaboon baboon freplace axoon baboon fdelete+replace axoo baboon fdelete+replace+truncate axon baboon fdelete+replace+single baby baboon replace+truncate*2 balloon baboon replace+insert ; proc print data=test label; label GED='Generalized Edit Distance'; var String1 String2 GED Operation; run;
The following output shows the results.
Generalized Edit Distance Based on Operation
The SAS System 1 Generalized Edit Obs String1 String2 Distance Operation 1 baboon baboon 0 match 2 baXboon baboon 100 insert 3 baoon baboon 100 delete 4 baXoon baboon 100 replace 5 baboonX baboon 50 append 6 baboo baboon 10 truncate 7 babboon baboon 20 double 8 babon baboon 20 single 9 baobon baboon 20 swap 10 bab oon baboon 10 blank 11 bab,oon baboon 30 punctuation 12 bXaoon baboon 200 insert+delete 13 bXaYoon baboon 200 insert+replace 14 bXoon baboon 200 delete+replace 15 Xbaboon baboon 200 finsert 16 aboon baboon 200 trick question: swap+delete 17 Xaboon baboon 200 freplace 18 axoon baboon 300 fdelete+replace 19 axoo baboon 310 fdelete+replace+truncate 20 axon baboon 320 fdelete+replace+single 21 baby baboon 120 replace+truncate*2 22 balloon baboon 200 replace+insert
See Also |
|
Copyright © 2011 by SAS Institute Inc., Cary, NC, USA. All rights reserved.