Specifying Options for the Sessionize Transformation

Input Options

Use the Options tab to identify the input columns (typically Clickstream Parse), whose values are used by the Clickstream Sessionize transformation. In the Input window, you set the options to identify which of the input columns should be used in the various roles required for the Clickstream Sessionize transformation to operate properly. For example, Visitor ID column uniquely identifies the visitor. If no Visitor ID is available, an algorithm based on values from the Client IP column and the User agent column is used to create a new Visitor ID. This combined value is a last resort, as the value of analytics performed without a reliable visitor ID is severely reduced. Column roles for record ID, date, and timestamp are set here as well.
Additional input options are available to support propagation of values both forward and completely through the visitor session. For additional information about input options, see the Help for the Clickstream Sessionize Options tab.

Tables Options

TheTables options window is used to set the characteristics of the output table, specify new or existing columns, and specify libraries to be used.
The most commonly used table options include the following:
  • The Sort output table, Sort sequence, and Sort options enable management of how the output data is ordered.
  • Additional output library stores output data, including records of non-human interactions such as those done by spiders.
  • Permanent library path is used when the session is not closed because that user's session carried over into the next day's run, which was captured in a separate Web log. These sessions are marked with a 2 in the Exit point column field. (See the Exit point column description in the following list.)
  • Delete PERMLIB tables deletes PERMLIB tables when the job is initially set up. PERMLIB is used to store records from visitor sessions that have not yet closed. These represent activity of visitors occurring at the time that the Web log file was closed, and a new file started. Normally, the option should be set to Yes when you develop and test a job, running the same data multiple times. You should set the option to No when your job functionality has stabilized because batches of Web log data must be processed in chronological order
  • Session ID column creates a new column that identifies a particular user's session.
  • Entry point column is a binary field that represents whether this click is where the user entered the Web site.
  • Exit point column is a field that represents whether this click is where the user exited the Web site. The values can be 0 (Not an exit point), 1 (Is an exit point), and 2 (Do not know yet — pending the next 30 minutes of data to determine).
  • Eyeball time column is the amount of time the user spent on a page before continuing to the next page.
  • Session Closed column indicates whether the session is completed.
For additional information about table options, see the Help for the Clickstream Sessionize Options tab.

Tuning Options

The Tuning options window is used to determine session, group, and spider characteristics and how to handle them. The most commonly used tuning options include the following:
  • Session timeout determines the amount of time of inactivity until the session is closed. By default this is set to 30 minutes, which is an industry standard. However, you can change this value if you determine that there is a more appropriate value. If there is no activity for a particular visitor for 30 minutes or more, then the visitor's session is determined to have ended. If the time-out value has expired and activity restarts, a new session starts and is given a new Session ID.
  • Spider detection threshold, Spider force threshold, and Maximum average time between spider clicks are used to identify non-human activities.
  • Spider detection threshold controls the minimum number of clicks that must be in a session before NHV detection is performed on that session.
  • Spider force threshold controls the number of clicks in a session after which classification of the session as an NHV session is forced.
  • Maximum average time between spider clicks controls the maximum average spacing between click activity in a session under which the session is classified as an NHV session.
  • Spider Action is used to determine how to handle spider sessions once they are identified.
As with any tuning option, you should experiment with the settings to achieve the desired results for your data. The combination of these options determines the number of spiders detected. The basic reactions are in the following list:
  • Raising the Maximum average time between clicks detects more spiders.
  • Lowering the Spider detection threshold detects more spiders.
  • Raising the Spider detection threshold detects fewer spiders.
  • Lowering the Spider force threshold detects more spiders.
  • Decreasing the Session timeout value results in more sessions.
For additional information about any of these or other Sessionize options, see the Help for the Sessionize Options tab.