Best Practices for Clickstream Jobs

Overview

The following best practices apply to all tutorial and template clickstream jobs.

Order of Log Processing

All logs should be processed in chronological order. Clickstream Sessionize functionality requires this in order to avoid prematurely timing out visitor sessions.

Using a Unicode SAS Application Server

SAS Unicode provides an environment for multiple language support, which includes support for the following tasks:
  • handling different character encodings
  • providing messages according to the locale and language of the user
  • enabling a multi-lingual user interface
A SAS Unicode server is a particularly good option for Web data because most Web sites have a global presence. Therefore, they can receive request parameters (such as search terms) in multiple languages.
To use SAS Unicode server successfully, you must ensure that the SAS servers and all sessions that they invoke are SAS Unicode server sessions. There are multiple ways of setting up the environment to use SAS Unicode server. For the method that is best for your environment, see http://support.sas.com/resources/papers/92unicodesrvr.pdf.

Data Source Encoding Considerations

The character set encoding of the data source being processed should be understood when using the clickstream jobs. This is an important consideration because, by default, the clickstream jobs expect a UTF-8 character set encoding in the input raw log file.
If the data source encoding used in the incoming data differs from the Encoding of the incoming data setting in the Clickstream Log transformation within a job, the data is decoded incorrectly.
Therefore, you must verify the following before you process a data source:
  • the value of the Encoding of the incoming data setting in the Clickstream Log transformation matches the encoding of the incoming data being processed
  • the data records in the incoming raw log file have been encoded using the same encoding
If either of the above items is not met, then the Decode input data? option in the Clickstream Log transformation must be set to No. For example, the SAS page tag encodes all of the data that it collects in UTF-8. As a result, when the data source is a SAS page tag log, data will be decoded correctly.
On the other hand, standard Web logs contain data records that were encoded using the character encoding specified on each Web page by the Web site programmers. If the Web site follows a common standard encoding, the value of Encoding of the incoming data should be set to that encoding. If the Web site was not developed with a common encoding across Web pages, the Decode input data? setting in the Clickstream Log transformation should be set to No to maintain data integrity.
For more detail on the Clickstream Log decoding options and their usage, see the Help for the Input pane in the Options tab of the Clickstream Log transformation.

Backing Up Output Tables

As in any production environment, it is good practice to backup your data.
By default, each execution of a SAS Data Integration Studio job overwrites the output tables created in the previous execution. If this is not what you want, then the output tables from each run should be retained.
Note: A clickstream job can produce large output tables. Make sure you monitor the disk space that is occupied by backups of these tables.
The following table lists the main output tables in a clickstream job and how these tables can be preserved after the job is executed.
Main Output Tables in Clickstream Jobs
Output Tables
How to Preserve Output Tables After the Job Is Executed
Data output table from a Sessionize transformation
Back up the data output table for each Sessionize transformation.
To identify the library and table to be backed up, display the properties window for the Sessionize output table. Click the Physical Storage tab. Note the name of the library and table. These data tables are typically named WEBLOG_DETAIL.
Temporary work tables for parameters and rules that are output from the Parse transformation
Redirect the temporary work tables for parameters and rules to a permanent library. Then back up this permanent library.
To redirect the temporary work tables for parameters and rules, display the properties window for Clickstream Parse. Click the Options tab. In the Tables section, specify an Additional output library.
Temporary work tables for spiders and sessions that are output from the Sessionize transformation
Redirect the temporary work tables for spiders and sessions to a permanent library. Then back up this permanent library.
To redirect the temporary work tables for spiders and sessions, display the properties window for Clickstream Sessionize. Click the Options tab. In the Tables section, specify an Additional output library.
Permanent Library (Permlib) tables that are used to allow open sessions to be continued into the next execution of the Clickstream Sessionize transformation
See Managing Permanent Library Tables for more information about the usage of the Permanent Library tables.

Resetting the CLICKRC Macro Variable

If there is a warning or error during the execution of a Clickstream transformation, then the return code variable CLICKRC might be set to a non-zero value for the transformation. This is done to prevent cascading failures in the rest of the job. To reset the CLICKRC value, do one of the following:
  • Close and reopen the job. This creates a new SAS session on the application server.
  • Open the properties window for the job, click the Precode and Postcode tab, and enter the following code in the Precode window: %LET CLICKRC=0;.
Note: The CLICKRC macro variable is reset by default for all of the template jobs.

Managing Permanent Library Tables

The Clickstream Sessionize transformation has the capability to retain data from sessions that were not closed when the last record of a log file was processed. These records are then included when the next log file is processed. Clickstream Sessionize uses the location specified by the Permanent library path option setting of the Clickstream Sessionize transformation to store these records.
This behavior is desirable for production use. However, it is a best practice to process the same log file multiple times until the job is producing the desired output when you initially modify a clickstream job. In this case, the retained records are included every time the job is run, which results in duplicate records. To avoid this duplication, the Clickstream Sessionize Delete PERMLIB tables option is set to Yes by default.
When your clickstream job is producing the desired output from a single log file and you are ready to deploy the job for production use, ensure that this option is set to No. This practice results in unclosed sessions being retained between processing of each subsequent log file.
The tables located in the Permanent Library path location are the only tables generated by the Clickstream transformations that are used across ETL runs. Therefore, they require special consideration when a job needs to be rerun. You should ensure that you have a backup of the tables contained in the Permanent Library location before you run a job. If a failure occurs in the Clickstream Sessionize transformation, we recommended that you restore the tables in the permanent library location to their original state before you rerun the job.