Best Practices for the Clickstream Parse Transformation

Handling Non-Human Visitors in the Clickstream Parse Transformation

Spiders, robots, crawlers, pingers, and any other computer programs are referred to as non-human visitors (NHVs). Activity from an NHV is handled by two approaches. The first approach uses the Filter Spiders by User Agent rule in the Clickstream Parse transformation. This rule matches commonly known strings found in the user agent of well-behaved NHVs that identify themselves as NHVs. By default, this rule deletes activity for these NHVs. The purpose of this detection is to eliminate NHV clicks as soon as possible.
In the second approach, NHV activity is handled in the Clickstream Sessionize transformation. The Clickstream Sessionize transformation uses a proprietary behavioral detection approach called Behavioral Identification of Non-Human Sessions (BINS). This approach examines the behavior of the visitor within a session. Then, it decides whether the behavior is more likely to be that of a human or a non-human visitor. For more information, see Managing Non-Human Visitor Detection.

Maintaining the Hold Buffer Size Setting

The option entered in the Hold Buffer Size field in the Input pane on the Options tab in the Clickstream Parse transformation can have a significant effect on the performance of the transformation. When Web servers write raw data to the logs, the records are typically written in chronological order. The hold buffer size option represents the amount of this data that is held in memory before it is written to the output table.
For example, the default value of 120 causes all records that have a timestamp within the last 120 seconds of the latest timestamp to be held in memory. With this value, any records that have a date-and-time stamp that is not within that 120-second range are added to the output table. This hold buffer usually enables any incoming records that are slightly out of chronological order to be corrected. Thus, a subsequent sort of the data can generally be avoided.
However, the default hold buffer size does not always work as expected. If you find that your incoming data is out of chronological order and exceeds this 120-second threshold, you can the increase the hold buffer size. However, the larger hold buffer increases the memory used by the Clickstream Parse transformation because more data is held in the buffer before it is sent to the output table.
If the hold buffer functionality is consistently unable to prevent a sort, it can be switched off with a value of 0. This setting can result in a subsequent sort being required. However, it removes some of the processing overhead that occurs in managing the buffer.

Optimizing Sort Using SORTSIZE

The Clickstream Parse transformation attempts to reorder records with the hold buffer size option to avoid a sort. However, a sort might be required in cases where records are severely out of chronological order. By default, the Clickstream Sessionize transformation performs a final sort of the output with Session ID, Date and Time, and Record ID set as the sort criteria. Minimizing the need to perform disk I/O operations improves the performance.
In both cases, performance can be improved by setting the SORTSIZE option in SAS. This option can be set on the Precode and Postcode tab found in all transformations. Simply select the Precode check box and enter a SORTSIZE value in the field such as options SORTSIZE=2G;. This code sets the SORTSIZE to 2 GB of RAM.
If you are running in an SMP or grid environment (such as in a multiple log job), keep in mind that this setting applies for each parallel process. For example, if you are running on a four-processor machine with 16 GB of total RAM, setting SORTSIZE to 2.5 GB consumes up to 10 GB of RAM (2.5 x 4 processors). This setting leaves 6 GB for the operating system and any other processes running on the machine.

Setting a Visitor ID Value

Whenever possible, select the columns that contain the visitor ID value on the Visitor ID tab of the Clickstream Parse transformation. A good visitor ID value uniquely identifies the activity of one and only one visitor. Thus, the quality and accuracy of the sessionized data is enhanced significantly when a known visitor ID value is provided.
Therefore, you should avoid using the default algorithm based on the client IP address and user-agent string, although this might be the only option available in some scenarios. For information about selecting a visitor ID, see the Help for the Visitor ID tab. Also, see Managing the Visitor ID.