How Batch Geocoding Works

Data Requirements for Geocoding

To achieve the most accurate geocoding, ensure that the address data set to be geocoded contains name, address, city, state, ZIP code, and ZIP+4 variables. At least the address and city variables are required.

Created Data Sets

The geocoding facility first reads the chains, nodes, and details data sets for the map specified in the %GCBATCH macro. Then it creates new data sets for the sorted and summarized versions in the SAS library that was specified with the GLIB macro variable. Names for the geocoding data sets are generated from the specified map's chains data set name. For example, if your chains data set is GMAPS.USAC and you specify GLIB=GEOLIB in the %GCBATCH macro, then the geocoding facility creates the following data sets:

GEOLIB.USAS

contains sorted chains.

GEOLIB.USAM

contains matchable street data summarized from the chains data set and sorted by state, ZIP code, street name, and city.

GEOLIB.USAP

contains point coordinates along the street segment taken from the map's nodes and details data sets.

These summary data sets are created automatically before the first address-matching process begins. After the data sets are created, they are regenerated only when the map's chains data set is updated or when NEWDATA=YES is specified in the %GCBATCH macro.

When choosing the SAS library to use for these created data sets, consider that—depending on the area of the base map—they can be quite large. If you use the WORK library, then the data sets will be deleted at the end of the current SAS session and must be regenerated if you want to perform geocoding again in a future SAS session.

Reference Data Sets

Additional data sets used in geocoding are supplied by SAS:

SASHELP.GCTYPE

contains the official street abbreviations used by the U.S. Postal Service and in TIGER data from the U.S. Census Bureau. These values are used to standardize your address observations before geocoding.

SASHELP.PLFISP

contains place names, state codes, and FIPS place codes for U.S. locations. The places primarily represent cities and towns, but the data set also includes some national parks, industrial parks, military installations, and so on.

SASHELP.ZIPCODE

contains U.S. ZIP codes, FIPS state and city codes, city names, and post office names. The ata set also contains the latitude and longitude for the centroid of each ZIP code area. If an address is not matched in the primary geocoding data sets, this data set is searched for a matching ZIP code. Updates for this data are available from the SAS Maps Online area at http://support.sas.com.

Match Addresses

The geocoding facility uses these data sets to match the addresses in the address data set. As it is processing the address data set, the geocoding facility provides a progress indicator. For every 10% of the addresses that are geocoded, a message is written to the SAS log.

When a match is found, the coordinates of the address location are added to the address data set, along with any other composite values for the specified address. For example, if the spatial data has a composite named TRACT that contains census tract numbers, you can use the geocoding process to add a TRACT variable to your address data set. The resulting geocoded address data set can be used as attribute data for the map, or it can be imported to add point data to the map by using a generic import.

If an address cannot be matched to the spatial data but the address includes a ZIP code, then the X and Y coordinates of the center of the ZIP code centroid for the zone are returned instead of the exact coordinates of the address. The centroid coordinates are read from the SASHELP.ZIPCODE data set.

For matching purposes, the geocoding process converts the address components to uppercase and attempts to convert direction and street type values to standard forms. The standardized versions of the address components are also added to the address data set. The M_ADDR, M_CITY, M_STATE, M_ZIP, and M_ZIP4 variables that are added to the address data set reflect the address values that were actually matched during the geocoding process. If a matching observation was found in the sorted chains data set, that row number is placed in the M_OBS variable.

Address Match Scoring

All address matches are not equal. The geocoding process attempts to match different elements of each specified address. When multiple address elements match, the resulting X/Y location is more certain. The geocoding process adds _SCORE_, _STATUS_, and _NOTES_ variables to the address data set to indicate which elements were matched. These variable values can also indicate whether there was a problem with a specific part of the address.

The _SCORE_ variable's value is a numeric rating of the certainty of the address match. A higher score indicates a better match. The score is calculated by adding points for matching various components of the address.

A score of 100 indicates that a match was found for all of the components of the address. A score of 100 is possible only if the address in the data set includes values for all components and the geocoding lookup data contains variables for all components. For example, if the address in the data set does not have a ZIP+4 value or if the lookup data set does not have a PLUS4 type variable, then the highest possible score is 95.

_SCORE_ Values for Address Elements

Address Element Matched	Value added to _SCORE_ Value
Street number	40
Street name	20
Street type	5
Street direction	5
City	5
State	5
Five-digit ZIP code	15
First three digits of ZIP code	5
ZIP+4 code	5

The _STATUS_ variable provides a general indication of the match result:

_STATUS_ Values for Match Results

_STATUS_ Value	Description
found	Street name and ZIP code or city and state match found. X/Y interpolated along street. _SCORE_ indicates how many elements were matched.
ZIP Match	Street name not found in lookup data. ZIP code was found in SASHELP.ZIPCODE. X/Y for ZIP center is within the lookup data extents.
ZIP Match OffMap	Street name not found in lookup data. ZIP code was found in SASHELP.ZIPCODE. X/Y for ZIP center is outside lookup data extents.
City/State Match	Street name not found in lookup data. City and state elements found in SASHELP.ZIPCODE. Multiple city and state matches were averaged for X/Y.
City not found	Address had missing ZIP code value. City is not in SASHELP.ZIPCODE. X/Y values are missing.
Unknown Address	No part of address was matched. X/Y values are missing.

The _NOTES_ variable provides additional details about which address elements were matched or invalid:

_NOTES_ Values for Match Results

_NOTES_ Value	Description
ZC	Five-digit ZIP code matched.
ZC3	First three digits of ZIP code matched.
AD	Street name matched.
TY	Street type matched.
DP	Street direction prefix matched.
DS	Street direction suffix matched.
NM	House number matched.
ST	State matched.
CT	City matched.
CT3	Used with ZC3. Street matched only first 3-digits of the ZIP code in lookup data and either the city value was missing in the address or the city and state pair in lookup data differed.
NOADD	Street address is invalid.
NOZC	Address ZIP code is missing.
NOCT	Address city name is invalid.