The GEOCODE Procedure |
Output Data Sets |
By default, the GEOCODE procedure produces an output data set that contains all of the variables from the input address data set and the X, Y, and _MATCHED_ variables. You can also choose to add variables from the lookup data set to the output data set by using the ATTRIBUTEVAR= option. For example, if you are using SASHELP.ZIPCODE as the lookup data set, then you could assign the county name (COUNTYNM) to each matched observation in the output data set.
The default name for the output data set is DATAn, where n is the smallest integer that makes the name unique. For example, if the DATA1 data set already exists, then the default name for the output data set is DATA2.
The label of the output data set contains the text, "geocoded date" where date is the date when the output was created. This text is appended to the label from the input data set, if one exists.
For the STREET geocoding method, additional variables are included in the output data set. See Output Variables for Street Geocoding.
The SASHELP.ZIPCODE Data Set |
The default lookup data set for ZIP code geocoding and CITY geocoding is SASHELP.ZIPCODE. This data set is provided with Base SAS, and is updated for each SAS release.
You can download updated versions of the SASHELP.ZIPCODE data set from the SAS Maps Online Web site: www.sas.com/mapsonline.
SASHELP.ZIPCODE contains the following variables:
Name: |
Label: |
ZIP |
The 5-digit ZIP code |
Y |
Latitude (decimal degrees) of the center of the ZIP code. 0.0 for APO/FPO |
X |
Longitude (decimal degrees) of the center of the ZIP code. 0.0 for APO/FPO |
ZIP_CLASS |
ZIP code classification: M=APO/FPO; P=Post office box; U=Unique ZIP code used for large organization, businesses, or buildings; Blank=Standard/non-unique |
CITY |
Name of the city or organization |
STATE |
Two-digit number (FIPS code) for the state or territory |
STATECODE |
Two-character postal code for the state or territory name |
STATENAME |
Full name of the state or territory |
COUNTY |
FIPS county code. Blank for APO/FPO addresses. |
COUNTYNM |
Name of county or parish. Blank for APO/FPO addresses. |
MSA |
Metropolitan Service Area code by common population; no MSA for rural areas |
AREACODE |
Area code for the ZIP code. Blank for APO/FPO addresses. |
AREACODES |
Multiple area codes for the ZIP code. Blank for APO/FPO addresses. |
ALIAS_CITY |
Alternate names for the city. Each name is separated by "||". |
TIMEZONE |
Time zone for the ZIP code. Blank for APO/FPO addresses. |
GMTOFFSET |
Difference (hours) between GMT and time zone for the ZIP code. |
DST |
ZIP code observes Daylight Savings Time: Y is Yes N is No |
PONAME |
USPS Post Office name |
Alternate ZIP Code and ZIP+4 Lookup Data Sets |
While the SASHELP.ZIPCODE data set is the default lookup data set for the ZIP and CITY geocoding methods, data from other sources can be used as long as it is read into a SAS data set.
For ZIP code geocoding, any lookup data set must contain the following variables:
Default Name: |
Description: |
ZIP |
Five-digit ZIP code |
X |
Longitude of the center coordinate |
Y |
Latitude of the center coordinate |
For CITY geocoding, these additional variables are required:
CITY |
Name of the city |
STATECODE |
Two-character postal code for the state or province name |
Note: If you use an alternative ZIP code lookup data set, then the variable data types should match those of the SASHELP.ZIPCODE data set.
When you use ZIP+4 geocoding, you must specify an alternative lookup data set because the SASHELP.ZIPCODE data set does not contain any ZIP+4 values. This data set must contain the following variables:
Default Name: |
Description: |
ZIP |
Five-digit ZIP code |
PLUS4 |
Four-digit ZIP+4 extension |
X |
Longitude of the central coordinate |
Y |
Latitude of the central coordinate |
You can specify different names for the variables by using options in the PROC GEOCODE statement. For example, the LOOKUPPLUS4 option specifies the name of the ZIP+4 extension variable in the lookup data set.
The ZIP and PLUS4 variables can contain either character data or numeric data. The data type must match the type of the corresponding variable in your input data set.
Note: The character values in your input and lookup data sets do not need to be a case-sensitive match. Character value matching in the GEOCODE procedure is not case sensitive.
Additional attribute variables can also be in the alternate lookup data set even if they are not used to find matches. You can add these variables to the output data set by using the ATTRIBUTEVAR= option in the PROC GEOCODE statement.
You can obtain a lookup data set for ZIP+4 geocoding from the SAS Maps Online Web site at www.sas.com/mapsonline. On the Downloads page, select Geocoding to access the downloads that are related to geocoding.
An alternative source for ZIP+4 lookup data is the Geo*Data product from Melissa Data. You can use the %GCDMEL9 autocall macro to convert Geo*Data files to SAS data sets. For more information, see %GCDMEL9 Autocall Macro.
U.S. Military ZIP Codes |
ZIP codes for U.S. military post offices are provided in the ZIPMIL data set in the SASHELP library. You can combine this data set with the ZIPCODE data set to support military ZIP codes.
Data Sets for Range Geocoding |
Note: Range geocoding is for SAS 9.2 Phase 2 and later.
For Range geocoding, a lookup data set and a range data set are required. The range data set identifies ranges of IP addresses. The lookup data set contains geographic coordinates. Both the range data set and the lookup data set must contain a key variable that identifies locations for each IP range.
The lookup data set must contain the following variables:
a key variable that corresponds to a key variable in the range data set.
an X variable that contains the longitude value of the center coordinate. The default variable name is X.
a Y variable that contains the latitude value of the center coordinate. The default variable name is Y.
The range data set must contain the following variables:
a variable that specifies the beginning value of a range of IP addresses
a variable that specifies the ending value of a range of IP addresses
a key variable that corresponds to a key variable in the lookup data set
You can obtain lookup and range data from third-party vendors. One vendor is MaxMind, Inc. at www.maxmind.com . You can use the %MAXMIND autocall macro to convert comma-separated value (CSV) files from MaxMind into SAS data sets. For more information, see %MAXMIND Autocall Macro.
%GCDMEL9 Autocall Macro |
The %GCDMEL9 autocall macro enables you to directly import Geo*Data files from Melissa Data as SAS data sets. Geo*Data files contain third-party ZIP+4 lookup data for use with PLUS4 geocoding.
Geo*Data files are available for each state. The files are provided as text files within compressed (ZIP) archives. Melissa Data also provides the PKUNZIP utility to extract the text files.
The %GCDMEL9 macro uses the following macro variables:
specifies the name of the output data set.
specifies the location where the output data set is created.
(optional) specifies a label for the output data set.
specifies the name for a new library that is assigned for the location that you specified in the DATASETPATH macro variable.
specifies the location of the extracted Geo*Data files that you want to import. The %GCDMEL9 macro attempts to read all of the text (.txt) files in this directory.
specifies the path where temporary files are written. The default path is the path for the WORK library.
In this example, a Geo*Data file for the state of Delaware (DE.txt) is extracted to C:\Mydata. The lookup data set is created in the directory C:\Geocode and assigned the libref ZIP4. The resulting data set is named ZIP4.DELAWARE.
The following code imports the data:
/* Define macro variables */ %let UNZIPPEDPATH=C:\Mydata; %let DATASETPATH=C:\Geocode; %let DATASETNAME=Delaware; %let LIBNAME=ZIP4; %let DATASETLABEL=ZIP+4 lookup data for Delaware; /* Submit autocall macro */ %GCDMEL9;
%MAXMIND Autocall Macro |
The %MAXMIND autocall macro enables you to convert IP geocoding data from MaxMind, Inc. into SAS data sets. The %MAXMIND autocall macro supports MaxMind's IP data in comma-separated value (CSV) format.
Note: This feature is for SAS 9.2 Phase 2 and later.
The %MAXMIND macro uses the following macro variables:
specifies the path where the MaxMind CSV files are located. You must extract the files from the ZIP archive before using the %MAXMIND autocall macro.
specifies the path where the output SAS data sets are created. You must have write permissions for this path.
specifies the filename for the CSV file that contains IP address range values. The file that you specify must contain the startIpNum and endIpNum variables.
specifies the filename for the CSV file that contains longitude and latitude values.
specifies the name of the optional MaxMind CSV file that contains country names.
specifies the path where temporary files are written. The default path is the path for the WORK library.
The %MAXMIND macro creates the CITYBLOCKS and CITYLOCATION data sets in the path that you specified for the IPDATAPATH variable. The libref IPDATA is created automatically for this path.
In this example, data from MaxMind is located in C:\Mydata. The output SAS data sets are created in the directory C:\Geocode.
The following code imports the data:
%let CSVPATH=C:\Mydata; %let IPDATAPATH=C:\Geocode; %let CSVBLOCKSFILE=GeoLiteCity-Blocks.csv; %let CSVLOCATIONFILE=GeoLiteCity-Location.csv; %let CSVCOUNTRYFILE=GeoIPCountryWhois.csv; %maxmind;
The imported data sets are IPDATA.CITYBLOCKS and IPDATA.CITYLOCATION.
Optimizing Performance |
Geocoding often requires very large lookup data sets, which can affect the performance of the GEOCODE procedure. You can optimize your geocoding performance by performing the following actions:
Index your lookup data sets by using the appropriate variables.
Load the lookup data sets into memory by using the SASFILE statement.
If you use alternative lookup data sets, then indexing your lookup data sets can improve performance. You should create an index by using the variables that are appropriate for your geocoding method.
Note: The SASHELP.ZIPCODE data set and the ZIP4 data set from SAS Maps Online are optimized for use with the GEOCODE procedure. Additionally, data sets that you convert by using the %GCDMEL9 and %MAXMIND autocall macros are indexed automatically. No modifications are needed for any of these data sets.
Note: The STREET geocoding data sets that are provided by SAS are already indexed for the GEOCODE procedure.
For ZIP+4 geocoding, you should create a simple index on the ZIP variable and a compound index on the ZIP and ZIP+4 variables.
For RANGE geocoding, you should sort your lookup data set by the key variable, and then create a simple index with the key variable. You should sort the range data set by the beginning IP address variable, and then create two simple indexes for the beginning and ending IP address variables.
For more information, see Understanding SAS Indexes in the SAS Language Reference: Concepts.
You can load your lookup data sets into memory by using the SASFILE statement. Loading data into memory reduces I/O processing and can improve the speed of your geocoding operation. You should test your geocoding operations with the lookup data sets loaded into memory to determine whether there is sufficient memory and whether your performance is increased.
For more information, see SASFILE statement in the SAS Language Reference: Dictionary.
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.