SAS Text Miner uses the "bag-of-words"
approach to represent documents. That means that documents are represented
with a vector that contains the frequency with which each term occurs
in each document. In addition, word order is ignored. This approach
is very effective for short, paragraph-sized documents, but it can
cause a harmful loss of information with longer documents. You might
want to consider preprocessing your long documents in order to isolate
the content that is really of use in your model. For example, if you
are analyzing journal papers, you might find that analyzing only the
abstract gives the best results. Consider using the SAS DATA step
or an alternative programming language such as Perl to extract the
relevant content from long documents.