Previous Page | Next Page

Functions and CALL Routines

COMPGED Function



Returns the generalized edit distance between two strings.
Category: Character
Restriction: I18N Level 0

Syntax
Arguments
Details
The Order in Which Modifiers Appear
Definition of Generalized Edit Distance
Computing the Generalized Edit Distance
Examples of Errors
Making the Generalized Edit Distance Symmetric
Comparisons
Examples
See Also

Syntax

COMPGED(string-1, string-2 <,cutoff> <,modifiers>)


Arguments

string-1

specifies a character constant, variable, or expression.

string-2

specifies a character constant, variable, or expression.

cutoff

is a numeric constant, variable, or expression. If the actual generalized edit distance is greater than the value of cutoff, the value that is returned is equal to the value of cutoff.

Tip: Using a small value of cutoff improves the efficiency of COMPGED if the values of string-1 and string-2 are long.
modifiers

specifies a character string that can modify the action of the COMPGED function. You can use one or more of the following characters as a valid modifier:

i or I

ignores the case in string-1 and string-2.

l or L

removes leading blanks in string-1 and string-2 before comparing the values.

n or N

removes quotation marks from any argument that is an n-literal and ignores the case of string-1 and string-2.

: (colon)

truncates the longer of string-1 or string-2 to the length of the shorter string, or to one, whichever is greater.

Tip: COMPGED ignores blanks that are used as modifiers.

Details


The Order in Which Modifiers Appear

The order in which the modifiers appear in the COMPGED function is relevant.


Definition of Generalized Edit Distance

Generalized edit distance is a generalization of Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of deletions, insertions, or replacements of single characters that are required to transform string-1 into string-2.


Computing the Generalized Edit Distance

The COMPGED function returns the generalized edit distance between string-1 and string-2. The generalized edit distance is the minimum-cost sequence of operations for constructing string-1 from string-2.

The algorithm for computing the sum of the costs involves a pointer that points to a character in string-2 (the input string). An output string is constructed by a sequence of operations that might advance the pointer, add one or more characters to the output string, or both. Initially, the pointer points to the first character in the input string, and the output string is empty.

The operations and their costs are described in the following table.

Operation Default Cost in Units Description of Operation
APPEND 50 When the output string is longer than the input string, add any one character to the end of the output string without moving the pointer.
BLANK 10 Do any of the following:
  • Add one space character to the end of the output string without moving the pointer.

  • When the character at the pointer is a space character, advance the pointer by one position without changing the output string.

  • When the character at the pointer is a space character, add one space character to the end of the output string, and advance the pointer by one position.

If the cost for BLANK is set to zero by the COMPCOST function, the COMPGED function removes all space characters from both strings before doing the comparison.

DELETE 100 Advance the pointer by one position without changing the output string.
DOUBLE 20 Add the character at the pointer to the end of the output string without moving the pointer.
FDELETE 200 When the output string is empty, advance the pointer by one position without changing the output string.
FINSERT 200 When the pointer is in position one, add any one character to the end of the output string without moving the pointer.
FREPLACE 200 When the pointer is in position one and the output string is empty, add any one character to the end of the output string, and advance the pointer by one position.
INSERT 100 Add any one character to the end of the output string without moving the pointer.
MATCH 0 Copy the character at the pointer from the input string to the end of the output string, and advance the pointer by one position.
PUNCTUATION 30 Do any of the following:
  • Add one punctuation character to the end of the output string without moving the pointer.

  • When the character at the pointer is a punctuation character, advance the pointer by one position without changing the output string.

  • When the character at the pointer is a punctuation character, add one punctuation character to the end of the output string, and advance the pointer by one position.

If the cost for PUNCTUATION is set to zero by the COMPCOST function, the COMPGED function removes all punctuation characters from both strings before doing the comparison.

REPLACE 100 Add any one character to the end of the output string, and advance the pointer by one position.
SINGLE 20 When the character at the pointer is the same as the character that follows in the input string, advance the pointer by one position without changing the output string.
SWAP 20 Copy the character that follows the pointer from the input string to the output string. Then copy the character at the pointer from the input string to the output string. Advance the pointer two positions.
TRUNCATE 10 When the output string is shorter than the input string, advance the pointer by one position without changing the output string.

To set the cost of the string operations, you can use the CALL COMPCOST routine or use default costs. If you use the default costs, the values that are returned by COMPGED are approximately 100 times greater than the values that are returned by COMPLEV.


Examples of Errors

The rationale for determining the generalized edit distance is based on the number and types of typographical errors that can occur. COMPGED assigns a cost to each error and determines the minimum sum of these costs that could be incurred. Some types of errors can be more serious than others. For example, inserting an extra letter at the beginning of a string might be more serious than omitting a letter from the end of a string. For another example, if you type a word or phrase that exists in string-2 and introduce a typographical error, you might produce string-1 instead of string-2.


Making the Generalized Edit Distance Symmetric

Generalized edit distance is not necessarily symmetric. That is, the value that is returned by COMPGED(string1, string2) is not always equal to the value that is returned by COMPGED(string2, string1) . To make the generalized edit distance symmetric, use the CALL COMPCOST routine to assign equal costs to the operations within each of the following pairs:


Comparisons

You can compute the Levenshtein edit distance by using the COMPLEV function. You can compute the generalized edit distance by using the CALL COMPCOST routine and the COMPGED function. Computing generalized edit distance requires considerably more computer time than does computing Levenshtein edit distance. But generalized edit distance usually provides a more useful measure than Levenshtein edit distance for applications such as fuzzy file merging and text mining.


Examples

The following example uses the default costs to calculate the generalized edit distance.

options nodate pageno=1 linesize=70 pagesize=60;

data test;
   infile datalines missover;
   input String1 $char8. +1 String2 $char8. +1 Operation $40.;
   GED=compged(string1, string2);
   datalines;
baboon   baboon   match
baXboon  baboon   insert
baoon    baboon   delete
baXoon   baboon   replace
baboonX  baboon   append
baboo    baboon   truncate
babboon  baboon   double
babon    baboon   single
baobon   baboon   swap
bab oon  baboon   blank
bab,oon  baboon   punctuation
bXaoon   baboon   insert+delete
bXaYoon  baboon   insert+replace
bXoon    baboon   delete+replace
Xbaboon  baboon   finsert
aboon    baboon   trick question: swap+delete
Xaboon   baboon   freplace
axoon    baboon   fdelete+replace
axoo     baboon   fdelete+replace+truncate
axon     baboon   fdelete+replace+single
baby     baboon   replace+truncate*2
balloon  baboon   replace+insert
;

proc print data=test label;
   label GED='Generalized Edit Distance';
   var String1 String2 GED Operation;
run;

The following output shows the results.

Generalized Edit Distance Based on Operation

                            The SAS System                           1

                           Generalized
                               Edit
 Obs   String1   String2     Distance    Operation

   1   baboon    baboon          0       match                      
   2   baXboon   baboon        100       insert                     
   3   baoon     baboon        100       delete                     
   4   baXoon    baboon        100       replace                    
   5   baboonX   baboon         50       append                     
   6   baboo     baboon         10       truncate                   
   7   babboon   baboon         20       double                     
   8   babon     baboon         20       single                     
   9   baobon    baboon         20       swap                       
  10   bab oon   baboon         10       blank                      
  11   bab,oon   baboon         30       punctuation                
  12   bXaoon    baboon        200       insert+delete              
  13   bXaYoon   baboon        200       insert+replace             
  14   bXoon     baboon        200       delete+replace             
  15   Xbaboon   baboon        200       finsert                    
  16   aboon     baboon        200       trick question: swap+delete
  17   Xaboon    baboon        200       freplace                   
  18   axoon     baboon        300       fdelete+replace            
  19   axoo      baboon        310       fdelete+replace+truncate   
  20   axon      baboon        320       fdelete+replace+single     
  21   baby      baboon        120       replace+truncate*2         
  22   balloon   baboon        200       replace+insert             

See Also

Functions:

COMPARE Function

CALL COMPCOST Routine

COMPLEV Function

Previous Page | Next Page | Top of Page