Managing Columns

Problem

Your process flows are running slowly, and you suspect that the columns in your source tables are either poorly managed or superfluous.

Solution

You can perform the following tasks on columns to improve the performance of process flows:

Tasks

Drop Unneeded Columns

As soon as the data comes in from a source, consider dropping any columns that are not required for subsequent transformations in the flow. You can drop columns and make aggregations early in the process flow instead of later. This prevents the extraneous detail data from being carried along between all transformations in the flow. You should work to create a structure that matches the ultimate target table structure as closely as possible early in the process flow. Then, you can avoid carrying extra data along with the process flow.
To drop columns in the output table for a SAS Data Integration Studio transformation, click the Mapping tab and remove the extra columns from the Target table area on the tab. Use derived mappings to create expressions to map several columns together. You can then build your own transformation output table columns to match your ultimate target table and map.
Finally, you can control column mapping and propagation at a job level, at a transformation level, or even at a column level. Column propagation is the ability to automatically propagate columns through the intermediate tables in a process flow to the target table. If you do not need to map or propagate some of the columns in a flow, use one of the following options:
  • Automatically map columns and Automatically propagate columns options at Toolsthen selectOptionthen selectJob Editor (for new jobs)
  • Map Columns and Propagate Columns in the pop-up menu for a job or transformation (for selected jobs and transformations)
  • Map all columns, Map selected columns, Propagate from sources to targets, Propagate from targets to sources, and Propagate columns on the Mappings tab for a job or transformation (for selected jobs and transformations)
For information about mapping columns, see Maintaining Column Mappings. For information about column propagation, see Managing the Scope of Column Changes in Jobs.

Avoid Adding Unneeded Columns

As data is passed from step to step in a process flow, columns could be added or modified. For example, column names, lengths, or formats might be added or changed. In SAS Data Integration Studio, these modifications, which are done on the Mappings tab in the details pane of the Job Editor window or from the Mappings tab of the transformation, often result in the generation of an intermediate SQL view step. In many situations, that intermediate step adds processing time. In turn, these changes to columns can be propagated throughout the job. Try to avoid generating more of these steps than is necessary.
You should rework your flow so that activities such as column modifications or additions throughout many transformations in a process flow are consolidated within fewer transformations. Avoid using unnecessary aliases if the mapping between columns is one-to-one, then keep the same column names. Avoid multiple mappings on the same column, such as converting a column from a numeric to a character value in one transformation and then converting it back from a character to a numeric value in another transformation. For aggregation steps, rename any columns within those transformations, rather than in subsequent transformations.

Aggregate Columns for Efficiency

When you add column mappings, also consider the level of detail that is being retained. Ask these questions:
  • Is the data being processed at the right level of detail?
  • Can the data be aggregated in some way?
Aggregations and summarizations eliminate redundant information and reduce the number of records that have to be retained, processed, and loaded into a data collection.

Match the Size of Column Variables to Data Length

Verify that the size of the column variables in the data collection is appropriate to the data length. Consider both the current and future uses of the data:
  • Are the keys the right length for the current data?
  • Will the keys accommodate future growth?
  • Are the data sizes on other variables correct?
  • Do the data sizes need to be increased or decreased?
Data volumes multiply quickly, so ensure that the variables that are being stored in the data warehouse are the right size for the data.