The character
set encoding of the data source being processed should be understood
when using the clickstream jobs. This is an important consideration
because, by default, the clickstream jobs expect a UTF-8 character
set encoding in the input raw log file.
If the
data source encoding used in the incoming data differs from the
Encoding of the incoming data setting in the Clickstream
Log transformation within a job, the data is decoded incorrectly.
Therefore, you must
verify the following before you process a data source:
-
the value of the
Encoding
of the incoming data setting in the Clickstream Log transformation
matches the encoding of the incoming data being processed
-
the data records in the incoming
raw log file have been encoded using the same encoding
If either
of the above items is not met, then the
Decode input data? option in the Clickstream Log transformation must be set to
No. For example, the SAS page tag encodes all of the
data that it collects in UTF-8. As a result, when the data source
is a SAS page tag log, data will be decoded correctly.
On the
other hand, standard Web logs contain data records that were encoded
using the character encoding specified on each Web page by the Web
site programmers. If the Web site follows a common standard encoding,
the value of
Encoding of the incoming data should be set to that encoding. If the Web site was not developed
with a common encoding across Web pages, the
Decode input
data? setting in the Clickstream Log transformation should
be set to
No to maintain data integrity.
For more
detail on the Clickstream Log decoding options and their usage, see
the Help for the Input pane in the
Options tab of the Clickstream Log transformation.