Using the Hash Object

Why Use the Hash Object?

The hash object provides an efficient, convenient mechanism for quick data storage and retrieval. The hash object stores and retrieves data based on lookup keys.
To use the DATA step Component Object Interface, follow these steps:
  1. Declare the hash object.
  2. Create an instance of (instantiate) the hash object.
  3. Initialize lookup keys and data.
After you declare and instantiate a hash object, you can perform many tasks, including these:
  • Store and retrieve data.
  • Maintain key summaries.
  • Replace and remove data.
  • Compare hash objects.
  • Output a data set that contains the data in the hash object.
For example, suppose that you have a large data set that contains numeric lab results that correspond to a unique patient number and weight and a small data set that contains patient numbers (a subset of those in the large data set). You can load the large data set into a hash object using the unique patient number as the key and the weight values as the data. You can then iterate over the small data set using the patient number to look up the current patient in the hash object whose weight is over a certain value and output that data to a different data set.
Depending on the number of lookup keys and the size of the data set, the hash object lookup can be significantly faster than a standard format lookup.

Declaring and Instantiating a Hash Object

You declare a hash object using the DECLARE statement. After you declare the new hash object, use the _NEW_ operator to instantiate the object. For example:
declare hash myhash;
myhash = _new_ hash();
The DECLARE statement tells the compiler that the object reference MYHASH is of type hash. At this point, you have declared only the object reference MYHASH. It has the potential to hold a component object of type hash. You should declare the hash object only once. The _NEW_ operator creates an instance of the hash object and assigns it to the object reference MYHASH.
There is an alternative to the two-step process of using the DECLARE statement and the _NEW_ operator to declare and instantiate a component object. You can use the DECLARE statement to declare and instantiate the component object in one step.
declare hash myhash();
The above statement is equivalent to the following code:
declare hash myhash;
myhash = _new_ hash();
For more information, see DECLARE Statement, Hash and Hash Iterator Objects in SAS Component Objects: Reference and the _NEW_ Operator, Hash or Hash Iterator Object in SAS Component Objects: Reference.

Initializing Hash Object Data Using a Constructor

When you create a hash object, you might want to provide initialization data. A constructor is a method that you can use to instantiate a hash object and initialize the hash object data.
The hash object constructor can have either of the following formats:
  • declare hash object_name(argument_tag-1: value-1
        <, ...argument_tag-n: value-n>);
  • object_name = _new_ hash(argument_tag-1: value-1
        <, ...argument_tag-n: value-n>);
For more information, see the DECLARE Statement, Hash and Hash Iterator Objects in SAS Component Objects: Reference and the _NEW_ Operator, Hash or Hash Iterator Object in SAS Component Objects: Reference.

Defining Keys and Data

The hash object uses lookup keys to store and retrieve data. The keys and the data are DATA step variables that you use to initialize the hash object by using dot notation method calls. A key is defined by passing the key variable name to the DEFINEKEY method. Data is defined by passing the data variable name to the DEFINEDATA method. After you have defined all key and data variables, the DEFINEDONE method is called. Keys and data can consist of any number of character or numeric DATA step variables.
For example, the following code initializes a character key and a character data variable.
length d $20;
length k $20;

if _N_ = 1 then do;
   declare hash h();
   rc = h.defineKey('k');
   rc = h.defineData('d');
   rc = h.defineDone();
end;
You can have multiple key and data variables, but the entire key must be unique. You can store more than one data item with a particular key. For example, you could modify the previous example to store auxiliary numeric values with the character key and data. In this example, each key and each data item consists of a character value and a numeric value.
length d1 8;
length d2 $20;
length k1 $20;
length k2 8;

if _N_ = 1 then do;
   declare hash h();
   rc = h.defineKey('k1', 'k2');
   rc = h.defineData('d1', 'd2');
   rc = h.defineDone();
end;
For more information, see the DEFINEDATA Method in SAS Component Objects: Reference, DEFINEDONE Method in SAS Component Objects: Reference, and the DEFINEKEY Method in SAS Component Objects: Reference.
Note: The hash object does not assign values to key variables (for example, h.find(key:'abc')), and the SAS compiler cannot detect the data variable assignments that are performed by the hash object and the hash iterator. Therefore, if no assignment to a key or data variable appears in the program, SAS issues a note stating that the variable is uninitialized. To avoid receiving these notes, you can perform one of the following actions:
  • Set the NONOTES system option.
  • Provide an initial assignment statement (typically to a missing value) for each key and data variable.
  • Use the CALL MISSING routine with all the key and data variables as parameters. Here is an example.
    length d $20;
    length k $20;
    
    if _N_ = 1 then do;
       declare hash h();
       rc = h.defineKey('k');
       rc = h.defineData('d');
       rc = h.defineDone();
       call missing(k, d);
    end;

Non-Unique Key and Data Pairs

By default, all of the keys in a hash object are unique. This means one set of data variables exists for each key. In some situations, you might want to have duplicate keys in the hash object, that is, associate more than one set of data variables with a key.
For example, assume that the key is a patient ID and the data is a visit date. If the patient were to visit multiple times, multiple visit dates would be associated with the patient ID. When you create a hash object with the MULTIDATA:“YES” argument tag, multiple sets of the data variables are associated with the key.
If the data set contains duplicate keys, by default, the first instance is stored in the hash object and subsequent instances are ignored. To store the last instance in the hash object, use the DUPLICATE argument tag. The DUPLICATE argument tag also writes an error to the SAS log if there is a duplicate key.
However, the hash object allows storage of multiple values for each key if you use the MULTIDATA argument tag in the DECLARE statement or _NEW_ operator. The hash object keeps the multiple values in a list that is associated with the key. This list can be traversed and manipulated by using several methods such as HAS_NEXT or FIND_NEXT.
To traverse a multiple data item list, you must know the current list item. Start by calling the FIND method for a given key. The FIND method sets the current list item. Then to determine whether the key has multiple data values, call the HAS_NEXT method. After you have determined that the key has another data value, you can retrieve that value with the FIND_NEXT method. The FIND_NEXT method sets the current list item to the next item in the list and sets the corresponding data variable or variables for that item.
In addition to moving forward through the list for a given key, you can loop backwards through the list by using the HAS_PREV and FIND_PREV methods in a similar manner.
Note: For SAS 9.2 Phase 2 and later, the items in a multiple data item list are maintained in the order in which you insert them.
For more information about these and other methods associated with non-unique key and data pairs, see Dictionary of Hash and Hash Iterator Object Language Elements in SAS Component Objects: Reference.

Storing and Retrieving Data

How to Store and Retrieve Data

After you initialize the hash object's key and data variables, you can store data in the hash object using the ADD method, or you can use the dataset argument tag to load a data set into the hash object. If you use the dataset argument tag, and if the data set contains more than one observation with the same value of the key, by default, SAS keeps the first observation in the hash table and ignores subsequent observations. To store the last instance in the hash object or to send an error to the log if there is a duplicate key, use the DUPLICATE argument tag. To allow duplicate values for each key, use the MULTIDATA argument tag.
You can then use the FIND method to search and retrieve data from the hash object if one data value exists for each key. Use the FIND_NEXT and FIND_PREV methods to search and retrieve data if multiple data items exist for each key.
For more information, see ADD Method in SAS Component Objects: Reference, FIND Method in SAS Component Objects: Reference, FIND_NEXT Method in SAS Component Objects: Reference, and the FIND_PREV Method in SAS Component Objects: Reference.
You can consolidate a FIND method and ADD method using the REF method. In the following example, you can reduce the amount of code from this:
rc = h.find();
  if (rc != 0) then
    rc = h.add();
to a single method call:
rc = h.ref();
For more information, see the REF Method in SAS Component Objects: Reference.
Note: You can also use the hash iterator object to retrieve the hash object data, one data item at a time, in forward and reverse order. For more information, see Using the Hash Iterator Object .

Example 1: Using the ADD and FIND Methods to Store and Retrieve Data

The following example uses the ADD method to store the data in the hash object and associate the data with the key. The FIND method is then used to retrieve the data that is associated with the key value Homer.
data _null_;
length d $20;
length k $20;

/* Declare the hash object and key and data variables */
if _N_ = 1 then do;
   declare hash h();
   rc = h.defineKey('k');
   rc = h.defineData('d');
   rc = h.defineDone();
end;

/* Define constant value for key and data */
k = 'Homer';
d = 'Odyssey';
/* Use the ADD method to add the key and data to the hash object */
rc =h.add();
if (rc ne 0) then
   put 'Add failed.';

/* Define constant value for key and data */
k = 'Joyce';
d = 'Ulysses';
/* Use the ADD method to add the key and data to the hash object */
rc = h.add();
if (rc ne 0) then
   put 'Add failed.';

k = 'Homer';
/* Use the FIND method to retrieve the data associated with 'Homer' key */
rc = h.find();
if (rc = 0) then
   put d=;
else
   put 'Key Homer not found.';
run;
The FIND method assigns the data value Odyssey, which is associated with the key value Homer, to the variable D.

Example 2: Loading a Data Set and Using the FIND Method to Retrieve Data

Assume the data set SMALL contains two numeric variables K (key) and S (data) and another data set, LARGE, contains a corresponding key variable K. The following code loads the SMALL data set into the hash object, and then searches the hash object for key matches on the variable K from the LARGE data set.
data match;
   length k 8;
   length s 8;
   if _N_ = 1 then do;
      /* load SMALL data set into the hash object */
     declare hash h(dataset: "work.small";
      /* define SMALL data set variable K as key and S as value */
      h.defineKey('k');
      h.defineData('s');
      h.defineDone();
      /* avoid uninitialized variable notes */
      call missing(k, s);
   end;

/* use the SET statement to iterate over the LARGE data set using */
/* keys in the LARGE data set to match keys in the hash object */
set large;
rc = h.find();
if (rc = 0) then output;
run;
The dataset argument tag specifies the SMALL data set whose keys and data are read and loaded by the hash object during the DEFINEDONE method. The FIND method is then used to retrieve the data.

Maintaining Key Summaries

You can maintain a summary count for a hash object key by using the SUMINC argument tag when you declare the hash object. The tag value is a string expression that resolves to the name of a numeric DATA step variable – the SUMINC variable.
This SUMINC tag instructs the hash object to allocate internal storage for maintaining a summary value for each key.
The summary value of a hash key is initialized to the value of the SUMINC variable whenever the ADD or REPLACE method is used.
The summary value of a hash key is incremented by the value of the SUMINC variable whenever the FIND, CHECK, or REF method is used.
Note that the SUMINC variable can be negative, positive, or zero valued. The variable does not need to be an integer. The SUMINC value for a key is zero by default.
In the following example, the initial ADD method sets the summary count for K=99 to 1 before the ADD. Then each time a new COUNT value is given, the following FIND method adds the value to the key summary. In this example, one data value exists for each key. The SUM method retrieves the current value of the key summary and the value is stored in the DATA step variable TOTAL. If multiple items exist for each key, the SUMDUP method retrieves the current value of the key summary.
data _null_;
 length k count 8;
 length total 8;
 dcl hash myhash(suminc: 'count');
 myhash.defineKey('k');
 myhash.defineDone();

 k = 99;
 count = 1;
 myhash.add();


/* COUNT is given the value 2.5 and the */ 
/* FIND sets the summary to 3.5*/ 
 count = 2.5;
 myhash.find();

/* The COUNT of 3 is added to the FIND and */
/* sets the summary to 6.5. */
 count = 3;
 myhash.find();


/* The COUNT of -1 sets the summary to 5.5. */
 count = -1;
 myhash.find();

/* The SUM method gives the current value of */
/* the key summary to the variable TOTAL. */
 myhash.sum(sum: total);

/* The PUT statement prints total=5.5 in the log. */
 put total=;
 run;
In this example, a summary is maintained for each key value K=99 and K=100:
 k = 99;
 count = 1;
 myhash.add();
 /* key=99 summary is now 1 */

 k = 100;
 myhash.add();
 /* key=100 summary is now 1 */

 k = 99;
 myhash.find();
 /* key=99 summary is now 2 */

 count = 2;
 myhash.find();
 /* key=99 summary is now 4 */

 k = 100;
 myhash.find();
 /* key=100 summary is now 3 */

 myhash.sum(sum: total);
 put 'total for key 100 = 'total;

 k = 99;

myhash.sum(sum:total);
 put 'total for key 99 = ' total;
The first PUT statement prints the summary for K=100:
total for key 100 = 3
And the second PUT statement prints the summary for K=99:
total for key 99 = 4
You can use key summaries in conjunction with the dataset argument tag. As the data set is read into the hash object using the DEFINEDONE method, all key summaries are set to the SUMINC value. And, all subsequent FIND, CHECK, or ADD methods change the corresponding key summaries.
declare hash myhash(suminc: "keycount", dataset: "work.mydata");
You can use key summaries for counting the number of occurrences of given keys. In the following example, the data set MYDATA is loaded into a hash object and uses key summaries to keep count of the number of occurrences for each key in the data set KEYS. (The SUMINC variable is not set to a value, so the default initial value of zero is used.)
data mydata;
 input key;
datalines;
1
2
3
4
5
;
run;


data keys;
 input key;
datalines;
1
2
1
3
5
2
3
2
4
1
5
1
;
run;

data count;
 length total key 8;
 keep key total;

 declare hash myhash(suminc: "count", dataset:"mydata");
 myhash.defineKey('key');
 myhash.defineDone();
 count = 1;

 do while (not done);
   set keys end=done;
   rc = myhash.find();
 end;

 done = 0;
 do while (not done);
   set mydata end=done;
   rc = myhash.sum(sum: total);
   output;
 end;
 stop;
run;
Here is the output for the resulting data set.
Key Summary Output
For more information, see the SUM Method in SAS Component Objects: Reference and the SUMDUP Method in SAS Component Objects: Reference.

Replacing and Removing Data in the Hash Object

You can remove or replace data that is stored in the hash object using the following methods:
  • Use the REMOVE method to remove all data items.
  • Use the REPLACE method to replace all data items.
  • Use the REMOVEDUP method to remove only the current data item.
  • Use the REPLACEDUP method to replace only the current data item.
In the following example, the REPLACE method replaces the data Odyssey with Iliad, and the REMOVE method deletes the entire data entry associated with the Joyce key from the hash object.
data _null_;
length d $20;
length k $20;

/* Declare the hash object and key and data variables */
if _N_ = 1 then do;
   declare hash h();
   rc = h.defineKey('k');
   rc = h.defineData('d');
   rc = h.defineDone();
end;

/* Define constant value for key and data */
k = 'Joyce';
d = 'Ulysses';
/* Use the ADD method to add the key and data to the hash object */
rc = h.add();
if (rc ne 0) then
   put 'Add failed.';

/* Define constant value for key and data */
k = 'Homer';
d = 'Odyssey';
/* Use the ADD method to add the key and data to the hash object */
rc = h.add();
if (rc ne 0) then
   put 'Add failed.';

/* Use the REPLACE method to replace 'Odyssey' with 'Iliad' */
k = 'Homer';
d = 'Iliad';
rc = h.replace();
if (rc = 0) then
   put d=;
else
   put 'Replace not successful.';

/* Use the REMOVE method to remove the 'Joyce' key and data */
k = 'Joyce';
rc = h.remove();
if (rc = 0) then
   put k 'removed from hash object';
else
   put 'Deletion not successful.';

run;
The following lines are written to the SAS log.
d=Iliad
Joyce removed from hash object
Note: If an associated hash iterator is pointing to the key, the REMOVE method does not remove the key or data from the hash object. An error message is issued to the log.
For more information, see the REMOVE Method in SAS Component Objects: Reference, REMOVEDUP Method in SAS Component Objects: Reference, REPLACE Method in SAS Component Objects: Reference, and the REPLACEDUP Method in SAS Component Objects: Reference.

Saving Hash Object Data in a Data Set

You can create a data set that contains the data in a specified hash object by using the OUTPUT method. In the following example, two keys and data are added to the hash object and then output to the WORK.OUT data set.
options pageno=1 nodate;
data test;
length d1 8;
length d2 $20;
length k1 $20;
length k2 8;

/* Declare the hash object and two key and data variables */
if _N_ = 1 then do;
   declare hash h();
   rc = h.defineKey('k1', 'k2');
   rc = h.defineData('d1', 'd2');
   rc = h.defineDone();
end;

/* Define constant value for key and data */
k1 = 'Joyce';
k2 = 1001;
d1 = 3;
d2 = 'Ulysses';
rc = h.add();

/* Define constant value for key and data */
k1 = 'Homer';
k2 = 1002;
d1 = 5;
d2 = 'Odyssey';
rc = h.add();

/* Use the OUTPUT method to save the hash object data to the OUT data set */
rc = h.output(dataset: "work.out");
run;

proc print data=work.out;
run;
The following output shows the report that PROC PRINT generates.
Data Set Created from the Hash Object
Note that the hash object keys are not stored as part of the output data set. If you want to include the keys in the output data set, you must define the keys as data in the DEFINEDATA method. In the previous example, the DEFINEDATA method would be written this way:
rc = h.defineData('k1', 'k2', 'd1', 'd2');
For more information, see the OUTPUT Method in SAS Component Objects: Reference.

Comparing Hash Objects

You can compare one hash object to another by using the EQUALS method. In the following example, two hash objects are being compared. Note that the EQUALS method has two argument tags. The HASH argument tag is the name of the second hash object. The RESULTS argument tag is a numeric variable name that holds the result of the comparison (1 if equal and zero if not equal).
length eq k 8;

declare hash myhash1();
myhash1.defineKey('k');
myhash1.defineDone();

declare hash myhash2();
myhash2.defineKey('k');
myhash2.defineDone();

rc = myhash1.equals(hash: 'myhash2', result: eq);
 
For more information, see the EQUALS Method in SAS Component Objects: Reference.

Using Hash Object Attributes

You can use the DATA Step Component Interface to retrieve information from a hash object using an attribute. Use the following syntax for an attribute:
attribute_value=obj.attribute_name;
There are two attributes available to use with hash objects. NUM_ITEMS returns the number of items in a hash object and ITEM_SIZE returns the size (in bytes) of an item. The following example retrieves the number of items in a hash object:
n = myhash.num_items;
The following example retrieves the size of an item in a hash object:
s = myhash.item_size;
You can obtain an idea of how much memory the hash object is using with the ITEM_SIZE and NUM_ITEMS attributes. The ITEM_SIZE attribute does not reflect the initial overhead that the hash object requires, nor does it take into account any necessary internal alignments. Therefore, the use of ITEM_SIZE does not provide exact memory usage, but it gives a good approximation.
For more information, see the NUM_ITEMS Attribute in SAS Component Objects: Reference and the ITEM_SIZE Attribute in SAS Component Objects: Reference.