Previous Page | Next Page

Using DATA Step Component Objects

Using the Hash Object


Why Use the Hash Object?

The hash object provides an efficient, convenient mechanism for quick data storage and retrieval. The hash object stores and retrieves data based on lookup keys.

To use the DATA step Component Object Interface, follow these steps:

  1. Declare the hash object.

  2. Create an instance of (instantiate) the hash object.

  3. Initialize lookup keys and data.

After you declare and instantiate a hash object, you can perform many tasks, including these:

For example, suppose that you have a large data set that contains numeric lab results that correspond to a unique patient number and weight and a small data set that contains patient numbers (a subset of those in the large data set). You can load the large data set into a hash object using the unique patient number as the key and the weight values as the data. You can then iterate over the small data set using the patient number to look up the current patient in the hash object whose weight is over a certain value and output that data to a different data set.

Depending on the number of lookup keys and the size of the data set, the hash object lookup can be significantly faster than a standard format lookup.


Declaring and Instantiating a Hash Object

You declare a hash object using the DECLARE statement. After you declare the new hash object, use the _NEW_ operator to instantiate the object.

declare hash myhash;
myhash = _new_ hash();

The DECLARE statement tells the compiler that the object reference MYHASH is of type hash. At this point, you have declared only the object reference MYHASH. It has the potential to hold a component object of type hash. You should declare the hash object only once. The _NEW_ operator creates an instance of the hash object and assigns it to the object reference MYHASH.

As an alternative to the two-step process of using the DECLARE statement and the _NEW_ operator to declare and instantiate a component object, you can use the DECLARE statement to declare and instantiate the component object in one step.

declare hash myhash();

The above statement is equivalent to the following code:

declare hash myhash;
myhash = _new_ hash();

For more information about the DECLARE statement and the _NEW_ operator, see SAS Language Reference: Dictionary.


Initializing Hash Object Data Using a Constructor

When you create a hash object, you might want to provide initialization data. A constructor is a method that you can use to instantiate a hash object and initialize the hash object data.

The hash object constructor can have either of the following formats:

These are the valid hash object argument tags:

dataset: 'dataset_name'

is the name of a SAS data set to load into the hash object.

The name of the SAS data set can be a literal or a character variable. The data set name must be enclosed in single or double quotation marks. Macro variables must be in double quotation marks.

When declaring a hash object, you can use SAS data set options in the dataset argument tag. For a list of SAS data set options, see Data Set Options by Category in SAS Language Reference: Dictionary. For more information on using data set options when declaring a hash object, see the DECLARE Statement in SAS Language Reference: Dictionary.

Note:   If the data set contains duplicate keys, by default, the first instance is stored in the hash object; subsequent instances are ignored. To store the last instance in the hash object, use the DUPLICATE argument tag. The DUPLICATE argument tag also writes an error to the SAS log if there is a duplicate key.  [cautionend]

duplicate: 'option'

determines whether to ignore duplicate keys when loading a data set into the hash object. The default is to store the first key and ignore all subsequent duplicates. Options can be one of the following values:

'replace' | 'r'

stores the last duplicate key record.

'error' | 'e'

reports an error to the log if a duplicate key is found.

The following example using the REPLACE option stores brown for the key 620 and blue for the key 531. If you use the default, green would be stored for 620 and yellow would be stored for 531.

data table;
  input key data $;
  datalines;
  531 yellow
  620 green
  531 blue
  908 orange
  620 brown
  143 purple
 run;

data _null_;
length key 8 data $ 8;
if _n_ = 1 then do;
    declare hash myhash(dataset: "table", duplicate: "r");
    rc = myhash.definekey('key');
    rc = myhash.definedata('data');
    myhash.definedone();
    call missing(key,data);
 end;

rc = myhash.output(dataset:"otable");
run;

proc print data=otable;
run;
hashexp: n

is the hash object's internal table size, where the size of the hash table is 2n.

The value of HASHEXP is used as a power-of-two exponent to create the hash table size. For example, a value of 4 for HASHEXP equates to a hash table size of 24, or 16. The maximum value for HASHEXP is 20, which equates to a hash table size of 220 or 65536.

The hash table size is not equal to the number of items that can be stored. Think of the hash table as an array of containers. A hash table size of 16 would have 16 containers. Each container can hold an infinite number of items. The efficiency of the hash tables lies in the ability of the hash function to map items to and retrieve items from the containers.

In order to maximize the efficiency of the hash object lookup routines, you should set the hash table size according to the amount of data in the hash object. Try different HASHEXP values until you get the best result. For example, if the hash object contains one million items, a hash table size of 16 (HASHEXP = 4) would not be very efficient. A hash table size of 512 or 1024 (HASHEXP = 9 or 10) would result in better performance.

Default: 8, which equates to a hash table size of 28 or 256.
ordered: 'option'

specifies whether or how the data is returned in key-value order if you use the hash object with a hash iterator object or if you use the hash object OUTPUT method.

option can be one of the following values:

'ascending' | 'a'

Data is returned in ascending key-value order. Specifying 'ascending' is the same as specifying 'yes'.

'descending' | 'd'

Data is returned in descending key-value order.

'YES' | 'Y'

Data is returned in ascending key-value order. Specifying 'yes' is the same as specifying 'ascending'.

'NO' | 'N'

Data is returned in an undefined order.

Default: NO

You can enclose the argument value in double quotation marks, instead of single.

multidata: 'option'

specifies whether multiple data items are allowed for each key.

option can be one of the following values:

'YES' | 'Y'

Multiple data items are allowed for each key.

'NO' | 'N'

Only one data item is allowed for each key.

Default: NO
See: Non-Unique Key and Data Pairs.

You can enclose the argument value in double quotation marks, instead of single.

suminc: 'variable-name'

maintains a summary count of hash object keys. The SUMINC argument value is given a DATA step variable, which holds the sum increment, that is, how much to add to the key summary for each reference to the key. The SUMINC value can be greater than, less than, or equal to 0. For example, a key summary changes using the current value of the DATA step variable.

dcl hash myhash(suminc: 'count');
For more information, see Maintaining Key Summaries.

For more information on the DECLARE statement and the _NEW_ operator, see the SAS Language Reference: Dictionary.


Defining Keys and Data

The hash object uses lookup keys to store and retrieve data. The keys and the data are DATA step variables that you use to initialize the hash object by using dot notation method calls. A key is defined by passing the key variable name to the DEFINEKEY method. Data is defined by passing the data variable name to the DEFINEDATA method. After you have defined all key and data variables, the DEFINEDONE method is called. Keys and data can consist of any number of character or numeric DATA step variables.

For example, the following code initializes a character key and a character data variable.

length d $20;
length k $20;

if _N_ = 1 then do;
   declare hash h;
   rc = h.defineKey('k');
   rc = h.defineData('d');
   rc = h.defineDone();
end;

You can have multiple key and data variables, but the entire key must be unique. You can store more than one data item with a particular key. For example, you could modify the previous example to store auxiliary numeric values with the character key and data. In this example, each key and each data item consists of a character value and a numeric value.

length d1 8;
length d2 $20;
length k1 $20;
length k2 8;

if _N_ = 1 then do;
   declare hash h;
   rc = h.defineKey('k1', 'k2');
   rc = h.defineData('d1', 'd2');
   rc = h.defineDone();
end;

For more information about the DEFINEDATA, DEFINEDONE, and the DEFINEKEY methods, see the SAS Language Reference: Dictionary.

Note:   The hash object does not assign values to key variables (for example, h.find(key:'abc')), and the SAS compiler cannot detect the data variable assignments that are performed by the hash object and the hash iterator. Therefore, if no assignment to a key or data variable appears in the program, SAS will issue a note stating that the variable is uninitialized. To avoid receiving these notes, you can perform one of the following actions:

  [cautionend]

Non-Unique Key and Data Pairs

By default, all of the keys in a hash object are unique. This means one set of data variables exists for each key. In some situations, you might want to have duplicate keys in the hash object, that is, associate more than one set of data variables with a key.

For example, assume that the key is a patient ID and the data is a visit date. If the patient were to visit multiple times, multiple visit dates would be associated with the patient ID. When you create a hash object with the MULTIDATA:"YES" argument tag, multiple sets of the data variables are associated with the key.

If the data set contains duplicate keys, by default, the first instance is stored in the hash object and subsequent instances are ignored. To store the last instance in the hash object, use the DUPLICATE argument tag. The DUPLICATE argument tag also writes an error to the SAS log if there is a duplicate key.

However, the hash object allows storage of multiple values for each key if you use the MULTIDATA argument tag in the DECLARE statement or _NEW_ operator. The hash object keeps the multiple values in a list that is associated with the key. This list can be traversed and manipulated by using several methods such as HAS_NEXT or FIND_NEXT.

To traverse a multiple data item list, you must know the current list item. Start by calling the FIND method for a given key. The FIND method sets the current list item. Then to determine whether the key has multiple data values, call the HAS_NEXT method. After you have determined that the key has another data value, you can retrieve that value with the FIND_NEXT method. The FIND_NEXT method sets the current list item to the next item in the list and sets the corresponding data variable or variables for that item.

In addition to moving forward through the list for a given key, you can loop backwards through the list by using the HAS_PREV and FIND_PREV methods in a similar manner.

Note:   For SAS 9.2 Phase 2 and later, the items in a multiple data item list are maintained in the order in which you insert them.  [cautionend]

For more information about these and other methods associated with non-unique key and data pairs, see Hash and Hash Iterator Object Language Elements in SAS Language Reference: Dictionary.


Storing and Retrieving Data

After you initialize the hash object's key and data variables, you can store data in the hash object using the ADD method, or you can use the dataset argument tag to load a data set into the hash object. If you use the dataset argument tag, and if the data set contains more than one observation with the same value of the key, by default, SAS keeps the first observation in the hash table and ignores subsequent observations. To store the last instance in the hash object or to send an error to the log if there is a duplicate key, use the DUPLICATE argument tag. To allow duplicate values for each key, use the MULTIDATA argument tag.

You can then use the FIND method to search and retrieve data from the hash object if one data value exists for each key. Use the FIND_NEXT and FIND_PREV methods to search and retrieve data if multiple data items exist for each key.

For more information about the ADD, FIND, FIND_NEXT, and FIND_PREV, see the SAS Language Reference: Dictionary.

You can consolidate a FIND method and ADD method using the REF method. In the following example, you can reduce the amount of code from this:

rc = h.find();
  if (rc != 0) then
    rc = h.add();

to a single method call:

rc = h.ref();

For more information about the REF Method, see the SAS Language Reference: Dictionary.

Note:   You can also use the hash iterator object to retrieve the hash object data, one data item at a time, in forward and reverse order. For more information, see Using the Hash Iterator Object.  [cautionend]


Example 1: Using the ADD and FIND Methods to Store and Retrieve Data

The following example uses the ADD method to store the data in the hash object and associate the data with the key. The FIND method is then used to retrieve the data that is associated with the key value Homer.

data _null_;
length d $20;
length k $20;

/* Declare the hash object and key and data variables */
if _N_ = 1 then do;
   declare hash h();
   rc = h.defineKey('k');
   rc = h.defineData('d');
   rc = h.defineDone();
end;

/* Define constant value for key and data */
k = 'Homer';
d = 'Odyssey';
/* Use the ADD method to add the key and data to the hash object */
rc = h.add();
if (rc ne 0) then
   put 'Add failed.';

/* Define constant value for key and data */
k = 'Joyce';
d = 'Ulysses';
/* Use the ADD method to add the key and data to the hash object */
rc = h.add();
if (rc ne 0) then
   put 'Add failed.';

k = 'Homer';
/* Use the FIND method to retrieve the data associated with 'Homer' key */
rc = h.find();
if (rc = 0) then
   put d=;
else
   put 'Key Homer not found.';
run;

The FIND method assigns the data value Odyssey, which is associated with the key value Homer, to the variable D.


Example 2: Loading a Data Set and Using the FIND Method to Retrieve Data

Assume the data set SMALL contains two numeric variables K (key) and S (data) and another data set, LARGE, contains a corresponding key variable K. The following code loads the SMALL data set into the hash object, and then searches the hash object for key matches on the variable K from the LARGE data set.

data match;
   length k 8;
   length s 8;
   if _N_ = 1 then do;
      /* load SMALL data set into the hash object */
     declare hash h(dataset: "work.small";
      /* define SMALL data set variable K as key and S as value */
      h.defineKey('k');
      h.defineData('s');
      h.defineDone();
      /* avoid uninitialized variable notes */
      call missing(k, s);
   end;

/* use the SET statement to iterate over the LARGE data set using */
/* keys in the LARGE data set to match keys in the hash object */
set large;
rc = h.find();
if (rc = 0) then output;
run;

The dataset argument tag specifies the SMALL data set whose keys and data will be read and loaded by the hash object during the DEFINEDONE method. The FIND method is then used to retrieve the data.


Maintaining Key Summaries

You can maintain a summary count for a hash object key by using the SUMINC argument tag. This argument instructs the hash object to allocate internal storage in each record to store a summary value in the record each time that the record is used by a FIND, CHECK, or REF method. The SUMINC value is also used to maintain a summary count of hash object keys after a FIND, CHECK, or REF method. The SUMINC argument tag is given a DATA step variable, which holds the sum increment, that is, how much to add to the key summary for each reference to the key. The SUMINC value can be greater than, less than, or equal to 0.

The SUMINC value is also used to initialize the summary on an ADD method. Each time the ADD method occurs, the key to the SUMINC value is initialized.

In the following example, the initial ADD method sets the summary count for K=99 to 1 before the ADD. Then each time a new COUNT value is given, the following FIND method adds the value to the key summary. In this example, one data value exists for each key. The SUM method retrieves the current value of the key summary and the value is stored in the DATA step variable TOTAL. If multiple items exist for each key, the SUMDUP method would retrieve the current value of the key summary.

data _null_;
 length k count 8;
 length total 8;
 dcl hash myhash(suminc: 'count');
 myhash.defineKey('k');
 myhash.defineDone();

 k = 99;
 count = 1;
 myhash.add();


/* COUNT is given the value 2.5 and the */ 
/* FIND sets the summary to 3.5*/ 
 count = 2.5;
 myhash.find();

/* The COUNT of 3 is added to the FIND and */
/* sets the summary to 6.5. */
 count = 3;
 myhash.find();


/* The COUNT of -1 sets the summary to 5.5. */
 count = -1;
 myhash.find();

/* The SUM method gives the current value of */
/* the key summary to the variable TOTAL. */
 myhash.sum(sum: total);

/* The PUT statement prints total=5.5 in the log. */
 put total=;
 run;

In this example, a summary is maintained for each key value K=99 and K=100:

 k = 99;
 count = 1;
 myhash.add();
 /* key=99 summary is now 1 */

 k = 100;
 myhash.add();
 /* key=100 summary is now 1 */

 k = 99;
 myhash.find();
 /* key=99 summary is now 2 */

 count = 2;
 myhash.find();
 /* key=99 summary is now 4 */

 k = 100;
 myhash.find();
 /* key=100 summary is now 3 */

 myhash.sum(sum: total);
 put 'total for key 100 = 'total;

 k = 99;
 myhash.sum(sum:total);
 put 'total for key 99 = ' total;

The first PUT statement prints the summary for K=100:

total for key 100 = 3

And the second PUT statement prints the summary for K=99:

total for key 99 = 4

You can use key summaries in conjunction with the dataset argument tag. As the data set is read into the hash object using the DEFINEDONE method, all key summaries are set to the SUMINC value and all subsequent FIND, CHECK, or ADD methods change the corresponding key summaries.

declare hash myhash(suminc: "keycount", dataset: "work.mydata");

You can use key summaries for counting the number of occurrences of given keys. In the following example, the data set MYDATA is loaded into a hash object and uses key summaries to keep count of the number of occurrences for each key in the data set KEYS. (The SUMINC variable is not set to a value, so the default initial value of zero is used.)

data mydata;
 input key;
datalines;
1
2
3
4
5
;
run;


data keys;
 input key;
datalines;
1
2
1
3
5
2
3
2
4
1
5
1
;
run;

data count;
 length total key 8;
 keep key total;

 declare hash myhash(suminc: "count", dataset:"mydata");
 myhash.defineKey('key');
 myhash.defineDone();
 count = 1;

 do while (not done);
   set keys end=done;
   rc = myhash.find();
 end;

 done = 0;
 do while (not done);
   set mydata end=done;
   rc = myhash.sum(sum: total);
   output;
 end;
 stop;
run;

Here is the output for the resulting data set.

  
        Obs    total    key
          1       4       1
          2       3       2
          3       2       3
          4       1       4
          5       2       5

For more information about the SUM method and the SUMDUP method see the SAS Language Reference: Dictionary.


Replacing and Removing Data in the Hash Object

You can remove or replace data that is stored in the hash object using the following methods:

In the following example, the REPLACE method replaces the data Odyssey with Iliad, and the REMOVE method deletes the entire data entry associated with the Joyce key from the hash object.

data _null_;
length d $20;
length k $20;

/* Declare the hash object and key and data variables */
if _N_ = 1 then do;
   declare hash h;
   rc = h.defineKey('k');
   rc = h.defineData('d');
   rc = h.defineDone();
end;

/* Define constant value for key and data */
k = 'Joyce';
d = 'Ulysses';
/* Use the ADD method to add the key and data to the hash object */
rc = h.add();
if (rc ne 0) then
   put 'Add failed.';

/* Define constant value for key and data */
k = 'Homer';
d = 'Odyssey';
/* Use the ADD method to add the key and data to the hash object */
rc = h.add();
if (rc ne 0) then
   put 'Add failed.';

/* Use the REPLACE method to replace 'Odyssey' with 'Iliad' */
k = 'Homer';
d = 'Iliad';
rc = h.replace();
if (rc = 0) then
   put d=;
else
   put 'Replace not successful.';

/* Use the REMOVE method to remove the 'Joyce' key and data */
k = 'Joyce';
rc = h.remove();
if (rc = 0) then
   put k 'removed from hash object';
else
   put 'Deletion not successful.';

run;

The following lines are written to the SAS log.

d=Iliad
Joyce removed from hash object

Note:   If an associated hash iterator is pointing to the key, the REMOVE method will not remove the key or data from the hash object. An error message is issued to the log.  [cautionend]

For more information on the REMOVE, REMOVEDUP, REPLACE, and REPLACEDUP , see the SAS Language Reference: Dictionary.


Saving Hash Object Data in a Data Set

You can create a data set that contains the data in a specified hash object by using the OUTPUT method. In the following example, two keys and data are added to the hash object and then output to the WORK.OUT data set.

data test;
length d1 8;
length d2 $20;
length k1 $20;
length k2 8;

/* Declare the hash object and two key and data variables */
if _N_ = 1 then do;
   declare hash h;
   rc = h.defineKey('k1', 'k2');
   rc = h.defineData('d1', 'd2');
   rc = h.defineDone();
end;

/* Define constant value for key and data */
k1 = 'Joyce';
k2 = 1001;
d1 = 3;
d2 = 'Ulysses';
rc = h.add();

/* Define constant value for key and data */
k1 = 'Homer';
k2 = 1002;
d1 = 5;
d2 = 'Odyssey';
rc = h.add();

/* Use the OUTPUT method to save the hash object data to the OUT data set */
rc = h.output(dataset: "work.out");
run;

proc print data=work.out;
run;

The following output shows the report that PROC PRINT generates.

Data Set Created from the Hash Object

                                The SAS System                               1

                             Obs    d1      d2

                              1      5    Odyssey
                              2      3    Ulysses

Note that the hash object keys are not stored as part of the output data set. If you want to include the keys in the output data set, you must define the keys as data in the DEFINEDATA method. In the previous example, the DEFINEDATA method would be written this way:

rc = h.defineData('k1', 'k2', 'd1', 'd2');

For more information on the OUTPUT Method, see SAS Language Reference: Dictionary.


Comparing Hash Objects

You can compare one hash object to another by using the EQUALS method. In the following example, two hash objects are being compared. Note that the EQUALS method has two argument tags. The HASH argument tag is the name of the second hash object and the RESULTS argument tag is a numeric variable name that will hold the result of the comparison (1 if equal and zero if not equal).

length eq k 8;

declare hash myhash1();
myhash1.defineKey('k');
myhash1.defineDone();

declare hash myhash2();
myhash2.defineKey('k');
myhash2.defineDone();

rc = myhash1.equals(hash: 'myhash2', result: eq);
 

For more information about the EQUALS Method, see SAS Language Reference: Dictionary.


Using Hash Object Attributes

You can use the DATA Step Component Interface to retrieve information from a hash object using an attribute. Use the following syntax for an attribute:

attribute_value=obj.attribute_name;

There are two attributes available to use with hash objects. NUM_ITEMS returns the number of items in a hash object and ITEM_SIZE returns the size (in bytes) of an item. The following example retrieves the number of items in a hash object:

n = myhash.num_items;

The following example retrieves the size of an item in a hash object:

s = myhash.item_size;

You can obtain an idea of how much memory the hash object is using with the ITEM_SIZE and NUM_ITEMS attributes. The ITEM_SIZE attribute does not reflect the initial overhead that the hash object requires, nor does it take into account any necessary internal alignments. Therefore, the use of ITEM_SIZE will not provide exact memory usage, but it will give a good approximation.

For more information about the NUM_ITEMS and ITEM_SIZE attributes, see the SAS Language Reference: Dictionary.

Previous Page | Next Page | Top of Page