About the Basic (Multiple) Web Log Template Job

The Basic (Multiple) Web Log Template provides you with a parameterized version of a simple template job that enables you process multiple clickstream logs from the same or multiple servers. It also enables you to optimize processing time through the use of symmetric multi-processing using SAS MP Connect or grid computing. Finally, the template manages outputs and resources to avoid contention.

The Multiple Log Template job uses the same Clickstream Log, Clickstream Parse, and Clickstream Sessionize transformations as are used in the single log template job. The multiple clickstream log files are sent through a series of loops that are enclosed in the standard SAS Data Integration Studio Loop and Loop End transformations. In addition, several specialized transformations prepare the data and parameter values for the loops, group them to be sessionized, create detailed output, and generate an output table. The Directory Contents transformation generates a list of raw Web logs to be passed into the first loop. Each iteration of the loop processes one Web log.

The data is accessed from the raw Web logs by each parallel SAS session running in the first loop. Within the first loop, the Clickstream Log transformation reads a small number of the raw Web log records in order to determine the Web log type. Once the Web log type is determined, the transformation creates a SAS DATA step view that is used to read the raw Web log data. Still within the first loop, the Clickstream Parse transformation accesses the view built by the Clickstream Log transformation, and begins to process each incoming click observation as follows:

All AFTER INPUT rules are applied after an observation is initially read. Most filtering occurs here, where non-important data can be deleted very early in the process.
If the observation is not deleted, then the observation is parsed. This includes parsing of data such as the browser, browser version, platform, query parameters, referrer parameters, and cookies.
AFTER PARSE rules are then applied. Some filtering might occur here, if the decision to filter depends upon parsed data. Otherwise, the filtering should be implemented using an AFTER INPUT rule.
Each observation is placed into an appropriate output group. The output group is decided using a grouping algorithm based on the Visitor ID or Client IP. (The algorithm also uses the User Agent when no Visitor ID value is supplied.) This practice ensures that all of the observations for a specific visitor session are stored in the same group. A list of group files created within each session is represented by L in Multiple Log Job Process Flow.

Note: You can configure the Number of Groups setting to optimize the job flow and support grid processing when identifying sessions. For example, entering the value 5 generates five groups. This setting enables you to execute up to five parallel sessionize loops.

The Clickstream Combine Groups generated transformation reads the group listing files and creates a SAS DATA step view that combines all the individual group files for a particular group. For example, the Group 1 data view accesses all of the group 1 data tables created during processing of the first loop. This transformation also creates a data table that is represented by G in Multiple Log Job Process Flow. This data table contains the list of data views that were created. This list is used to drive the second loop.

The second loop again takes advantage of symmetrical multi-processing to identify visitor sessions and to complete the visitor ID value from the start to the end of those sessions. This is accomplished using the Clickstream Sessionize transformation.

The completion of the visitor ID ensures that the visitor ID value that is assigned to users after they log on is present on every record of the session. This persistence holds even when the users browse the site for a period of time before logging in and after they log out. The visitor ID value is useful for connecting referring sites (purchased advertising, for example) to specific visitors and their final activity on the site (such as completing an online purchase).

Each parallel session reads observations from one of the group views created by the Clickstream Combine Groups transformation and creates a single output data table in which sessions have been identified and visitor IDs have been completed.

After the second loop finishes, the Clickstream Create Detail transformation combines each output from the second loop to create the final composite detail data table.

The following figure illustrates the process flow for multiple clickstream log jobs.

Multiple Log Job Process Flow

Sections of this figure are included in the descriptions of each stage of the template's processing.