Managing Non-Human Visitor Detection

Overview of Non-Human Visitors

Spiders, robots, crawlers, pingers, and any other computer program are referred to as non-human visitors (NHV). Spiders (a search engine bot, for example) surf the Web site traveling various links to determine the contents of all of the Web pages. All spiders or NHVs have certain behavior characteristics that make it possible to identify them such as clicking at a rate faster than humanly possible or pinging at an exact interval.
Activity from NHVs is handled in two locations. The first is in the Clickstream Parse transformation using the Filter Spiders by User Agent rule. This rule matches commonly known strings found in the user agent of well-behaved NHVs who identify themselves as an NHV. By default, this rule deletes activity for these NHVs. The purpose of this detection is to eliminate NHV clicks as soon as possible.
The second location NHV activity is handled is during the Clickstream Sessionize transformation, using a proprietary behavioral detection approach that examines the behavior of the visitor within a session and decides whether the behavior is likely to be that of a human or a non-human visitor. This process is known as Behavioral Identification of Non-Human Sessions (BINS), and is configured using the spider-related options on the Clickstream Sessionize transformation. See the Clickstream Sessionize Options tab help for details on how to configure this functionality.

Problem

You have already filtered and removed the NHVs found by the Clickstream Parse transformation using the rule that examines the User Agent string, but you want to analyze the visitor behavior to ensure that none of the remaining sessions were created by NHVs.

Solution

Set the options in the Clickstream Sessionize properties window to detect any NHVs.

Task

Perform the following steps to set the options in the Clickstream Sessionize properties window:
  1. Open the Tuning category on the Options tab in the Clickstream Sessionize Properties window.
  2. Specify a value in the Spider detection threshold, Spider force threshold, and Maximum average time between spider clicks fields. For example, the Web site's administrator determines that for the site's visitors, no human visitor is likely to perform more than 50 clicks in a session. Therefore, you might decide to set the Spider force threshold to 50, forcing the detection of an NHV when the number of clicks in the session reaches 50 or higher.
  3. Select a value in the Spider Action field. This value determines whether the session is isolated, deleted, or no action is taken once the spider is identified.
    Although the Spider Action does not directly impact the detection of NHVs, it does impact what happens to the data for any NHV. The default of ISOLATE is useful as it separates the non-human data and allows you to validate that the detection heuristics are accurate. The DELETE action is perhaps useful once the heuristics are considered accurate and you just want the non-human data discarded. The final option of NONE means that the non-human sessions are not identified. so they are treated as any other session data.