The SIMILARITY Procedure

Getting Started: SIMILARITY Procedure

This section outlines the use of the SIMILARITY procedure and gives a cursory description of some of the analysis techniques that can be performed on time-stamped transactional data, time series, or sequentially ordered numeric data.

Given an input data set that contains numerous transaction variables recorded over time at no specific frequency, the SIMILARITY procedure can form equally spaced input and target time series as follows:

   PROC SIMILARITY DATA=<input-data-set>
                   OUT=<output-data-set>
                   OUTSUM=<summary-data-set>;
      ID <time-ID-variable> INTERVAL=<frequency>
                            ACCUMULATE=<statistic>;
      INPUT <input-time-stamp-variables>;
      TARGET <target-time-stamp-variables>;
   RUN;

The SIMILARITY procedure forms time series from the input time-stamped transactional data. It can provide results in output data sets or in other output formats using the Output Delivery System (ODS). The examples in this section are more fully illustrated in the section Examples: SIMILARITY Procedure.

Time-stamped transactional data are often recorded at no fixed interval. Analysts often want to use time series analysis techniques that require fixed-time intervals. Therefore, the transactional data must be accumulated to form a fixed-interval time series.

Suppose that a bank wants to analyze the transactions that are associated with each of its customers over time. Further, suppose that the data set WORK.TRANSACTIONS contains three variables that are related to the customer transactions (CUSTOMER, DATE, and WITHDRAWAL) and one variable that contains an example fraudulent behavior (FRAUD).

The following statements illustrate how to use the SIMILARITY procedure to accumulate time-stamped transactional data to form a daily time series based on the accumulated daily totals of each type of transaction (WITHDRAWALS and FRAUD):

   proc similarity data=transactions out=timedata;
      by customer;
      id date interval=day accumulate=total;
      input withdrawals;
      target fraud;
   run;

The OUT=TIMEDATA option specifies that the resulting time series data for each customer are to be stored in the data set WORK.TIMEDATA. The INTERVAL=DAY option specifies that the transactions are to be accumulated on a daily basis. The ACCUMULATE=TOTAL option specifies that the sum of the transactions are to be accumulated. After the transactional data are accumulated into a time series format, the time series data can be normalized so that the "shape" or "profile" is analyzed.

For example, the following statements build on the previous statements and demonstrate normalization of the accumulated time series:

   proc similarity data=transactions out=timedata;
      by customer;
      id date interval=day accumulate=total;
      input withdrawals / NORMALIZE=STANDARD;
      target fraud      / NORMALIZE=STANDARD;
   run;

The NORMALIZE=STANDARD option specifies that each accumulated time series observation is normalized by subtracting the mean and then dividing by the standard deviation of the accumulated time series. The WORK.TIMEDATA data set now contains the accumulated and normalized time series data for each customer.

After the transactional data are accumulated into a time series format and normalized to a mean of zero and standard deviation of one, similarity analysis can be performed on the accumulated and normalized time series.

For example, the following statements build on the previous statements and demonstrate similarity analysis of the accumulated and normalized time series:

   proc similarity data=transactions
                   out=timedata OUTSUM=SUMMARY;
      by customer;
      id date interval=day accumulate=total;
      input withdrawals / normalize=standard;
      target fraud      / normalize=standard MEASURE=MABSDEV;
   run;

The MEASURE=MABSDEV option specifies the accumulated and normalized time series data that are associated with the variables WITHDRAWALS and FRAUD are to be compared by using mean absolute deviation. The OUTSUM=SUMMARY option specifies that the similarity analysis summary for each customer is to be stored in the data set WORK.SUMMARY.