Returns the generalized edit distance between two strings.

Category: | Character |

Restriction: | I18N Level 0 functions are designed for use with Single Byte Character Sets (SBCS) only. |

specifies a character constant, variable, or expression.

specifies a character constant, variable, or expression.

is a numeric constant, variable, or expression. If the actual generalized edit distance is greater than the value of cutoff, the value that is returned is equal to the value of cutoff.

specifies a character string that can modify the action of the COMPGED function. You can use one or more of the following characters as a valid modifier:

i or I | ignores the case in string–1 and string–2. |

l or L | removes leading blanks in string–1 and string–2 before comparing the values. |

n or N | removes quotation marks from any argument that is an n-literal and ignores the case of string–1 and string–2. |

: (colon) | truncates the longer of string–1 or string–2 to the length of the shorter string, or to one, whichever is greater. |

The COMPGED function
returns the generalized edit distance between string-1 and string-2. The generalized
edit distance is the minimum-cost sequence of operations for constructing string-1 from string-2.

The algorithm for computing
the sum of the costs involves a pointer that points to a character
in string-2 (the input string).
An output string is constructed by a sequence of operations that might
advance the pointer, add one or more characters to the output string,
or both. Initially, the pointer points to the first character in the
input string, and the output string is empty.

The rationale for determining
the generalized edit distance is based on the number and types of
typographical errors that can occur. COMPGED assigns a cost to each
error and determines the minimum sum of these costs that could be
incurred. Some types of errors can be more serious than others. For
example, inserting an extra letter at the beginning of a string might
be more serious than omitting a letter from the end of a string. For
another example, if you type a word or phrase that exists in string-2 and introduce a typographical error,
you might produce string-1 instead
of string-2.

Generalized edit distance
is not necessarily symmetric. That is, the value that is returned
by

`COMPGED(string1, string2)`

is not always
equal to the value that is returned by ```
COMPGED(string2,
string1)
```

. To make the generalized edit distance symmetric,
use the CALL COMPCOST routine to assign equal costs to the operations
within each of the following pairs:
You can compute the
Levenshtein edit distance by using the COMPLEV function. You can compute
the generalized edit distance by using the CALL COMPCOST routine and
the COMPGED function. Computing generalized edit distance requires
considerably more computer time than does computing Levenshtein edit
distance. But generalized edit distance usually provides a more useful
measure than Levenshtein edit distance for applications such as fuzzy
file merging and text mining.

data test; infile datalines missover; input String1 $char8. +1 String2 $char8. +1 Operation $40.; GED=compged(string1, string2); datalines; baboon baboon match baXboon baboon insert baoon baboon delete baXoon baboon replace baboonX baboon append baboo baboon truncate babboon baboon double babon baboon single baobon baboon swap bab oon baboon blank bab,oon baboon punctuation bXaoon baboon insert+delete bXaYoon baboon insert+replace bXoon baboon delete+replace Xbaboon baboon finsert aboon baboon trick question: swap+delete Xaboon baboon freplace axoon baboon fdelete+replace axoo baboon fdelete+replace+truncate axon baboon fdelete+replace+single baby baboon replace+truncate*2 balloon baboon replace+insert ; proc print data=test label; label GED='Generalized Edit Distance'; var String1 String2 GED Operation; run;

Copyright © SAS Institute Inc. All rights reserved.