The task of logically comparing character values to determine if they are equal can be performed with a straight-forward IF statement:
if varA = varB then ...
However, when the values are expected to be unequal there may still exist a need to detect whether the values are similar enough to represent the same value.This can be done by using several DATA Step functions to quantify the difference between the two values. Once done, the application can set or establish the threshold difference for what it will consider pseudo-equality. The three DATA Step functions discussed in this SAS Note are SPEDIS, COMPGED, and COMPLEV.
SPEDISdistance = spedis(query, keyword);
The SPEDIS function compares two strings and returns a numeric value that represents the "distance" between the two strings; the smaller the distance the greater likelihood that the strings represent the same value. A cost is assigned to each operation or change required to convert the function's second parameter (keyword) to the same value as the first parameter (query). The sum of each cost is divided by the length of the query to compute the distance. For instance, in the following code segment, the 'y' in the value for keyword must be replaced with an "i" in order for it to equal the value of query.
query = 'Smith'; keyword = 'Smyth'; distance = spedis(query, keyword);
The cost of replacing a letter in the middle of the string is 100 and the length of query is 5. The distance returned by the SPEDIS function would be 20.
In this code segment, the second 't' and the final 'e' in the keyword value must be removed in order for the keyword value to equal the query value.
query = 'Point'; keyword = 'Pointte'; distance = spedis(query, keyword);
The cost of deleting one of a double letter is 25 and the deletion of a letter from the end costs 50 for a sum cost of 75. The distance returned by SPEDIS would be 75 / 5, or 15.
Note that reversing the two parameters to the SPEDIS function would not necessarily yield the same distance. The cost of each operation is documented with the SPEDIS function in the SAS Language Reference: Dictionary.
COMPGEDgeneralized_edit_distance = compged(string1, string2, cutoff, modifiers);
The COMPGED function is similar to the SPEDIS function in that certain costs are assigned to the operations needed to contruct string1 from string2. The value returned is the sum of these operations and is not divided by the length of a parameter as SPEDIS does. The COMPGED function allows you to provide a cutoff value which represents the maximum value returned. It also allows you to provide modifiers which instruct the function to do things such as ignore case and remove leading or trailing blanks.
The COMPGED function is a generalization of the Levenshtein edit distance (see COMPLEV function below.)
The default cost of each operation as well as additional information on the cutoff and modifier parameters are documented with the COMPGED function in the SAS Language Reference: Dictionary. Note that if a cost other than the default is desired, the CALL COMPCOST function can be used to assign a different cost to an operation. CALL COMPCOST is also documented in the SAS Language Reference: Dictionary.
COMPLEVLev_edit_distance = complev(string1, string2, cutoff,modifiers);
The COMPLEV function returns the Levenshtein edit distance which represents the number of insertions, deletions, or replacements of single characters that are required to convert one string to the other. As with the COMPGED function, it allows you to provide a cutoff value which represents the maximum value returned. It also allows you to provide modifiers which instruct the function to do things such as ignore case and remove leading or trailing blanks.
Here is a link to more information on the Levenshtein edit distance:
http://en.wikipedia.org/wiki/Levenshtein_distance
SAS Data Quality ServerIn addition to the Data step functions mentioned previously, the SAS Data Quality Server product provides the ability the analyze, clean, and standardize data. The software uses parsing, standardization, and matching algorithms to analyze your data and help you identify and reduce inconsistencies, inaccuracies, and redundancies. This product is often used in conjunction with products such as DataFlux Corporation's dfPower Studio, which enables you to customize the Quality Knowledge Base that the SAS Data Quality Server uses to store it data-cleansing guidelines.
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | Base SAS | z/OS | ||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows XP Professional | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: |
Date Modified: | 2008-09-29 11:10:17 |
Date Created: | 2008-09-18 09:14:16 |