The SIMILARITY Procedure |
User-Defined Functions and Subroutines |
A user-defined routine can be written in the SAS language by using the FCMP procedure or the C language by using both the FCMP procedure and the PROTO procedure, respectively. The SIMILARITY procedure cannot use C language routines directly. The procedure can only use SAS language routines that might or might not call C language routines. Creating user-defined routines is more completely described in the FCMP procedure and the PROTO procedure documentation. The FCMP and PROTO procedures are part of Base SAS software.
The SAS language provides integrated memory management and exception handling such as operations on missing values. The C language provides flexibility and allows the integration of existing C language libraries. However, proper memory management and exception handling are solely the responsibility of the user. Additionally, the support for standard C libraries is restricted. If given a choice, it is highly recommended that user-defined functions and subroutines be written in the SAS language using the FCMP procedure.
For each of the tasks described above, the following sections describe the required subroutine/function signature and provide examples of using a user-defined routine with the SIMILARITY procedure.
A user-defined transformation subroutine has the following subroutine signature:
SUBROUTINE <SUBROUTINE-NAME> ( <ARRAY-NAME>[*] );
where the array-name is the time series to be transformed.
For example, to duplicate the functionality of the built-in TRANSFORM=LOG option in the INPUT and TARGET statement, the following SAS statements create a user-defined version of this transformation called MYTRANSFORM and store this subroutine in the catalog SASUSER.MYSIMILAR.
proc fcmp outlib=sasuser.mysimilar.package; subroutine mytransform( series[*] ); outargs series; length = DIM(series); do i = 1 to length; value = series[i]; if value > 0 then do; series[i] = log( value ); end; else do; series[i] = .; end; end; endsub; run;
This user-defined subroutine can be specified in the TRANSFORM= option of the INPUT or TARGET statement as follows:
options cmplib = sasuser.mysimilar; proc similarity ...; ... input myinput / transform=mytransform; target mytarget / transform=mytransform; ... run;
A user-defined normalization subroutine has the following signature:
SUBROUTINE <SUBROUTINE-NAME> ( <ARRAY-NAME>[*] );
where the array-name is the sequence to be normalized.
For example, to duplicate the functionality of the built-in NORMALIZE=ABSOLUTE option in the INPUT and TARGET statement, the following SAS stements create a user-defined version of this normalization called MYNORMALIZE and store this subroutine in the catalog SASUSER.MYSIMILAR.
proc fcmp outlib=sasuser.mysimilar.package; subroutine mynormalize( sequence[*] ); outargs sequence; length = DIM(sequence); minimum = .; maximum = .; do i = 1 to length; value = sequence[i]; if nmiss(minimum) | nmiss(maximum) then do; minimum = value; maximum = value; end; if nmiss(value) = 0 then do; if value < minimum then minimum = value; if value > maximum then maximum = value; end; end; do i = 1 to length; value = sequence[i]; if nmiss( value ) | minimum > maximum then do; sequence[i] = .; end; else do; sequence[i] = (value - minimum) / (maximum - minimum); end; end; endsub; run;
This user-defined subroutine can be specified in the NORMALIZE= option of INPUT or TARGET statement as follows:
options cmplib = sasuser.mysimilar; proc similarity ...; ... input myinput / normalize=mynormalize; target mytarget / normalize=mynormalize; ... run;
A user-defined scaling subroutine has the following signature:
SUBROUTINE <SUBROUTINE-NAME> ( <ARRAY-NAME>[*], <ARRAY-NAME>[*] );
where the first array-name is the target sequence and the second array-name is the input sequence to be scaled.
For example, to duplicate the functionality of the built-in SCALE=ABSOLUTE option in the INPUT statement, the following SAS statements create a user-defined version of this scaling called MYSCALE and store this subroutine in the catalog SASUSER.MYSIMILAR.
proc fcmp outlib=sasuser.mysimilar.package; subroutine myscale( target[*], input[*] ); outargs input; length = DIM(target); minimum = .; maximum = .; do i = 1 to length; value = target[i]; if nmiss(minimum) | nmiss(maximum) then do; minimum = value; maximum = value; end; if nmiss(value) = 0 then do; if value < minimum then minimum = value; if value > maximum then maximum = value; end; end; do i = 1 to length; value = input[i]; if nmiss( value ) | minimum > maximum then do; input[i] = .; end; else do; input[i] = (value - minimum) / (maximum - minimum); end; end; endsub; run;
This user-defined subroutine can be specified in the SCALE= option of INPUT statement as follows:
options cmplib=sasuser.mysimilar; proc similarity ...; ... input myinput / scale=myscale; ... run;
A user-defined similarity measure function has the following signature:
FUNCTION <FUNCTION-NAME> ( <ARRAY-NAME>[*], <ARRAY-NAME>[*] );
where the first array-name is the target sequence and the second array-name is the input sequence. The return value of the function is the similarity measure associated with the target sequence and the input sequence.
For example, to duplicate the functionality of the built-in MEASURE=ABSDEV option in the TARGET statement with no warping, the following SAS statements create a user-defined version of this measure called MYMEASURE and store this subroutine in the catalog SASUSER.MYSIMILAR.
proc fcmp outlib=sasuser.mysimilar.package; function mymeasure( target[*], input[*] ); length = min(DIM(target), DIM(input)); sum = 0; num = 0; do i = 1 to length; x = input[i]; w = target[i]; if nmiss(x) = 0 & nmiss(w) = 0 then do; d = x - w; sum = sum + abs(d); end; end; if num <= 0 then return(.); return(sum); endsub; run;
This user-defined function can be specified in the MEASURE= option of TARGET statement as follows:
options cmplib=sasuser.mysimilar; proc similarity ...; ... target mytarget / measure=mymeasure; ... run;
For another example, to duplicate the functionality of the built-in MEASURE=SQRDEV and MEASURE=ABSDEV options by using the C language, the following SAS statements create a user-defined C language version of these measures called DTW_SQRDEV_C and DTW_ABSDEV_C; and store these functions in the catalog SASUSER.CSIMIL.CFUNCS. DTW refers to dynamic time warping. These C language functions can be then called by SAS language functions and subroutines.
proc proto package=sasuser.csimil.cfuncs; mapmiss double = 999999999; double dtw_sqrdev_c( double * target / iotype=input, int targetLength, double * input / iotype=input, int inputLength ); externc dtw_sqrdev_c; double dtw_sqrdev_c( double * target, int targetLength, double * input, int inputLength ) { int i,j; double x,w,d; double * prev = (double *)malloc( sizeof(double)*targetLength); double * curr = (double *)malloc( sizeof(double)*inputLength); if ( prev == 0 || curr == 0 ) return 999999999; x = input[0]; for ( j=0; j<targetLength; j++ ) { w = target[j]; d = x - w; d = d*d; if ( j == 0 ) prev[j] = d; else prev[j] = d + prev[j-1]; } for (i=1; i<inputLength; i++ ) { x = input[i]; j = 0; w = target[j]; d = x - w; d = d*d; curr[j] = d + prev[j]; for (j=1; j<targetLength; j++ ) { w = target[j]; d = x - w; d = d*d; curr[j] = d + fmin( prev[j], fmin( prev[j-1], curr[j])); } if ( i < targetLength ) { for( j=0; j<inputLength; j++ ) prev[j] = curr[j]; } } d = curr[inputLength-1]; free( (char*) prev); free( (char*) curr); return( d ); } externcend; double dtw_absdev_c( double * target / iotype=input, int targetLength, double * input / iotype=input, int inputLength ); externc dtw_absdev_c; double dtw_absdev_c( double * target, int targetLength, double * input, int inputLength ) { int i,j; double x,w,d; double * prev = (double *)malloc( sizeof(double)*targetLength); double * curr = (double *)malloc( sizeof(double)*inputLength); if ( prev == 0 || curr == 0 ) return 999999999; x = input[0]; for ( j=0; j<targetLength; j++ ) { w = target[j]; d = x - w; d = fabs(d); if (j == 0) prev[j] = d; else prev[j] = d + prev[j-1]; } for (i=1; i<inputLength; i++ ) { x = input[i]; j = 0; w = target[j]; d = x - w; d = fabs(d); curr[j] = d + prev[j]; for (j=1; j<targetLength; j++) { w = target[j]; d = x - w; d = fabs(d); curr[j] = d + fmin( prev[j], fmin( prev[j-1], curr[j] )); } if ( i < inputLength) { for ( j=0; j<targetLength; j++ ) prev[j] = curr[j]; } } d = curr[inputLength-1]; free( (char*) prev); free( (char*) curr); return( d ); } externcend; run;
The preceding SAS statements create two C language functions which can then be used in SAS language functions and/or subroutines. However, these functions cannot be directly used by the SIMILARITY procedure. In order to use these C language functions in the SIMILARITY procedure, two SAS language functions must be created that call these two C language functions. The following SAS statements create two user-defined SAS language versions of these measures called DTW_SQRDEV and DTW_ABSDEV and stores these functions in the catalog SASUSER.MYSIMILAR.FUNCS. These SAS language functions use the previously created C language function; the SAS language functions can then be used by the SIMILARITY procedure.
proc fcmp outlib=sasuser.mysimilar.funcs inlib=sasuser.cfuncs; function dtw_sqrdev( target[*], input[*] ); dev = dtw_sqrdev_c(target,DIM(target),input,DIM(input)); return( dev ); endsub; function dtw_absdev( target[*], input[*] ); dev = dtw_absdev_c(target,DIM(target),input,DIM(input)); return( dev ); endsub; run;
This user-defined function can be specified in the MEASURE= option of TARGET statement as follows:
options cmplib=sasuser.mysimilar; proc similarity ...; ... target mytarget / measure=dtw_sqrdev; target yourtarget / measure=dtw_absdev; ... run;
A user-defined similarity measure and warping path information function has the following signature:
FUNCTION <FUNCTION-NAME> ( <ARRAY-NAME>[*], <ARRAY-NAME>[*], <ARRAY-NAME>[*], <ARRAY-NAME>[*], <ARRAY-NAME>[*] );
where the first array-name is the target sequence, the second array-name is the input sequence, the third array-name is the returned target sequence indices, the fourth array-name is the returned input sequence indices, the fifth array-name is the returned path distances. The returned value of the function is the similarity measure. The last three returned arrays are used to compute the path and cost statistics.
The returned sequence indices must represent a valid warping path, that is, integers greater than zero and less than or equal to the sequence length and recorded in ascending order. The returned path distances must be nonnegative numbers.
Note: This procedure is experimental.
Copyright © 2008 by SAS Institute Inc., Cary, NC, USA. All rights reserved.