SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 61523: Model-fit statistics seem to be incorrect when an HP Data Partition node is used

DetailsHotfixAboutRate It

In SAS® Enterprise Miner™, the model-fit statistics for HP modeling nodes might be computed using both sampled data and distributed data. Dual-computation occurs when these conditions are true:

  • Your flow uses an HP Partition node.
  • You are using a distributed environment (grid mode).
  • The data is distributed.

The HP Partition node partitions the sample and the distributed data independently of each other.  As a result, the training and validation sample might not be representative of the actual distributed-data partition.  The model-fit statistics might seem to be incorrect when you compare results from non-HP modeling nodes to HP modeling nodes.  However, the results are different from each other due to the different partitions.

Click the Hot Fix tab in this note to access the hot fix for this issue.

After you apply the hot fix, new functionality is available.  You can specify the following statement in the Enterprise Miner "project start code", or in an autoexec.sas file:

%let hpdm_partition_resample=Y;

After specifying that statement, when you add and run an HP Data Partition node, the node samples from the distributed-data partition (instead of partitioning the sample as described below).   Note:  because HP sampling is not deterministic, there is no guarantee that the sample is the same size as a non-HP sample.

Background

When the data is in a distributed table, a sample is created when the data source is created.  If the target(s) is a class variable, then the sample is stratified.  This sample is used by all the non-HP nodes.  If your flow uses a Data Partition node or an HP Data Partition node, then that sample is partitioned.  The non-HP modeling nodes use that (partitioned) sample for training and assessment.  The non-HP nodes use the sample so that Enterprise Miner does not need to download large volumes of data to the SAS client.


The HP nodes use the entire data, not the sample.  As a result, HP modeling-nodes train on the entire (partitioned) distributed data, and the assessment results are produced on that data.

Enterprise Miner enables you to compare non-HP-modeling nodes to HP-modeling-nodes.  To enable that comparison, Enterprise Miner scores the sample and computes a separate set of assessments that are based on the sample.  However, the partitioned sample might not closely represent the partitioned distributed-table.  Some training observations in the partitioned sample might be in the validation portion of the distributed table, and vice versa.  The training set is not necessarily a sample of the training observations in the distributed table (and the validation set is not necessarily a sample of the validation observations).  

The difference is usually not noticeable when you are comparing different models.  However, it might be noticeable when you are comparing (for the same model) the assessment results based on the sample to the assessment results based on the entire table. 



Operating System and Release Information

Product FamilyProductSystemProduct ReleaseSAS Release
ReportedFixed*ReportedFixed*
SAS SystemSAS Enterprise MinerMicrosoft® Windows® for x6412.39.4 TS1M0
Microsoft Windows 8 Enterprise x6412.39.4 TS1M0
Microsoft Windows 8 Pro x6412.39.4 TS1M0
Microsoft Windows 8.1 Enterprise 32-bit12.39.4 TS1M0
Microsoft Windows 8.1 Enterprise x6412.39.4 TS1M0
Microsoft Windows 8.1 Pro 32-bit12.39.4 TS1M0
Microsoft Windows 8.1 Pro x6412.39.4 TS1M0
Microsoft Windows 1012.39.4 TS1M0
Microsoft Windows Server 2008 R212.39.4 TS1M0
Microsoft Windows Server 2008 for x6412.39.4 TS1M0
Microsoft Windows Server 2012 Datacenter12.39.4 TS1M0
Microsoft Windows Server 2012 R2 Datacenter12.39.4 TS1M0
Microsoft Windows Server 2012 R2 Std12.39.4 TS1M0
Microsoft Windows Server 2012 Std12.39.4 TS1M0
Windows 7 Enterprise x6412.39.4 TS1M0
Windows 7 Professional x6412.39.4 TS1M0
64-bit Enabled AIX12.39.4 TS1M0
64-bit Enabled Solaris12.39.4 TS1M0
HP-UX IPF12.39.4 TS1M0
Linux for x6412.39.4 TS1M0
Solaris for x6412.39.4 TS1M0
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.