The Basic (Multiple) Web Log Template provides you with a parameterized
version of a simple template job that enables you process multiple
clickstream logs from the same or multiple servers. It also enables
you to optimize processing time through the use of symmetric multi-processing
using SAS MP Connect or grid computing. Finally, the template manages
outputs and resources to avoid contention.
The Multiple
Log Template job uses the same Clickstream Log, Clickstream Parse,
and Clickstream Sessionize transformations as are used in the single
log template job. The multiple clickstream log files are sent through
a series of loops that are enclosed in the standard SAS Data Integration
Studio Loop and Loop End transformations. In addition, several specialized
transformations prepare the data and parameter values for the loops,
group them to be sessionized, create detailed output, and generate
an output table. The Directory Contents transformation generates a
list of raw Web logs to be passed into the first loop. Each iteration
of the loop processes one Web log.
The data
is accessed from the raw Web logs by each parallel SAS session running
in the first loop. Within the first loop, the Clickstream Log transformation
reads a small number of the raw Web log records in order to determine
the Web log type. Once the Web log type is determined, the transformation
creates a SAS DATA step view that is used to read the raw Web log
data. Still within the first loop, the Clickstream Parse transformation
accesses the view built by the Clickstream Log transformation, and
begins to process each incoming click observation as follows:
-
All AFTER
INPUT rules are applied after an observation is initially read. Most
filtering occurs here, where non-important data can be deleted very
early in the process.
-
If the
observation is not deleted, then the observation is parsed. This includes
parsing of data such as the browser, browser version, platform, query
parameters, referrer parameters, and cookies.
-
AFTER
PARSE rules are then applied. Some filtering might occur here, if
the decision to filter depends upon parsed data. Otherwise, the filtering
should be implemented using an AFTER INPUT rule.
-
Each observation
is placed into an appropriate output group. The output group is decided
using a grouping algorithm based on the Visitor ID or Client IP. (The
algorithm also uses the User Agent when no Visitor ID value is supplied.)
This practice ensures that all of the observations for a specific
visitor session are stored in the same group. A list of group files
created within each session is represented by L in
Multiple Log Job Process Flow.
Note: You can configure
the
Number of Groups setting to optimize
the job flow and support grid processing when identifying sessions.
For example, entering the value
5
generates
five groups. This setting enables you to execute up to five parallel
sessionize loops.
The Clickstream Combine Groups generated transformation
reads the group listing files and creates a SAS DATA step view that
combines all the individual group files for a particular group. For
example, the Group 1 data view accesses all of the group 1 data tables
created during processing of the first loop. This transformation also
creates a data table that is represented by G in
Multiple Log Job Process Flow. This data
table contains the list of data views that were created. This list
is used to drive the second loop.
The second
loop again takes advantage of symmetrical multi-processing to identify
visitor sessions and to complete the visitor ID value from the start
to the end of those sessions. This is accomplished using the Clickstream
Sessionize transformation.
The completion
of the visitor ID ensures that the visitor ID value that is assigned
to users after they log on is present on every record of the session.
This persistence holds even when the users browse the site for a period
of time before logging in and after they log out. The visitor ID value
is useful for connecting referring sites (purchased advertising, for
example) to specific visitors and their final activity on the site
(such as completing an online purchase).
Each parallel
session reads observations from one of the group views created by
the Clickstream Combine Groups transformation and creates a single
output data table in which sessions have been identified and visitor
IDs have been completed.
After
the second loop finishes, the Clickstream Create Detail transformation
combines each output from the second loop to create the final composite
detail data table.
Sections
of this figure are included in the descriptions of each stage of the
template's processing.