Transform Data

Introduction

Transform Data icon in the SAS Data Loader window
Use the Transform Data directive to filter data, manage columns, and summarize data in one or more Hadoop source tables.

Enable the Spark Runtime Target

Support for the Apache Spark runtime target is enabled in the Hadoop Configuration panel of the Configuration window. When Spark is enabled, new instances of the following directives use Spark by default:
  • Transform Data
  • Cleanse Data
  • Cluster-Survive Data
The default runtime target can be overridden using the Settings menu. To learn more about runtime targets, see Enable Support for Impala and Spark.
Note: Changing the default runtime target does not change the runtime target for saved directives. Saved directives continue to run with their existing runtime target unless they are opened, reconfigured, and saved.
Note: Enabling Spark changes the truncation of character columns, as described in the Usage Notes for Spark.
Settings Button in Transform Data

Example

The following example depicts the process of creating and running a directive that contains several transformations. The example opens a source table of customer information, selects columns for the target, and applies two filters.
  1. On the SAS Data Loader directives page, click Transform Data. The Source Table task is displayed. For more information about data sources and tables, see Viewing Data Sources and Tables.
  2. In the Source Table task, click the schema that contains the source table that you will transform. When the tables appear, select the source table and click Next.
  3. In the Transformation task, click a transformation:
    • Click Filter Data to exclude unwanted rows from the target table using rules or a user-written expression.
    • Click Manage Columns to reorder, rename, or remove columns from the target table. You can also apply user-written expressions to modify column data or add new data to new columns. The Advanced Editor is provided to display available functions, provide syntax help, and load function syntax into your expression.
    • Click Summarize Rows to group rows based on the values in one or more columns. For each group, you can generate summary aggregations from selected numeric columns.
    Your job can consist of one or more transformations. Multiple transformations are executed in the order in which you define them. A logical order for all three transformations is filter data, manage columns, and summarize rows.
  4. Click Filter Data.
    Filter Data icon
  5. In the Filter Data transformation, click Specify rules or Specify expression.
  6. To specify rules, follow these steps.
    1. Select the first column, operator, and value. Note that the available operators change based on the type of the column. To learn about the available operators, see About the Operators in the Filter Data Transformation.
    2. Choose a logical operator (AND or OR)
    3. Select the second column, logical operator, and value.
    4. To add another rule, click Add Rule.
  7. To filter rows with a user-written expression, follow these steps:
    1. To paste in an existing expression, click the expression text box and press Ctrl+V or your equivalent.
      Your expression can use either SAS DS2 functions (with the MapReduce runtime target,) or DataFlux Expression Engine Language functions (with the Spark runtime target.) Click Settings to display the selected runtime target. To learn more about runtime targets, see Enable Support for Impala and Spark.
      To learn about the requirements for expressions, see Develop Expressions for Directives.
    2. To write an expression, expand the Functions and categories in the Resources box.
    3. Select a function for your expression and click Add function syntax to expression. Refer to the syntax help at the bottom of the Resources box as needed.
    4. In the expression text box, replace the placeholders in the default function syntax with column names and values. To insert column names, expand Columns in the Resources box.
  8. After you define rules or an expression, click Next to end the directive, select a target table, and run the directive. To add a Manage Columns or Summarize Rows transformation, click Add Another Transformation.
  9. In the Transformation task, click Manage Columns.
    Manage Columns icon
  10. In the Manage Columns transformation, you can reorder, rename, change type, change length, and remove source columns from the target. You can also apply user-written expressions to modify column data or to generate new data for new columns.
  11. To reorder columns, use the vertical arrow icons. The first row in Selected columns is the target column in the first position (fully left.)
    Selected Target Columns in Manage Columns Transformation
  12. To change the target column name, type, or length, click Target Name, Type, or Length.
  13. To remove a source column from the target, click the row in Selected columns and then click Remove selected row icon
  14. To replace existing column data with data that is generated by a user-written expression, click a Selected column and click Expression. At this point, you can enter or paste an existing expression in the corresponding text field.
    Your expression can use either SAS DS2 functions (with the MapReduce runtime target,) or DataFlux EEL functions (with the Spark runtime target.) Click Settings to display the selected runtime target. To learn more about runtime targets, see Enable Support for Impala and Spark.
    To learn about the requirements for expressions, see Develop Expressions for Directives.
  15. To add a new target column, and to use the Advanced Editor to write an expression for that column, click Add Column with Advanced Editor icon.
  16. To use the Advanced Editor, select functions and column names from the Resources box. When your expression is complete, select Save or Save New to return to the Manage Columns transformation. The new column appears at the bottom of Selected Columns.
    Advanced Editor for Expressions
  17. Click Add a new transformation, and then, in the Transformation task, click Summarize.
  18. In the Summarize Rows task, click Group rows by to specify a column whose values will be used to group rows. You can specify additional columns that will form subgroups. Each group and subgroup will receive a value in each aggregation column.
    Note: If your source data is in Hive 13 (0.13.0 or lower), then the Summarize Rows task will not handle special characters in column names. To resolve the issue, either rename the columns or ask your Hadoop administrator to upgrade to Hive 14 (0.14.0 or higher.)
  19. Click Select a column to specify a summarization, and then click and select an aggregation. To learn about the available aggregations, see About the Aggregations in the Summarize Rows Transformation.
  20. Click New column name and enter or paste replacement names for the aggregation columns.
    Summary columns and aggregations
  21. When your summaries are complete, click Next to conclude your job.
  22. In the Target Table task, select the schema that contains or will contain your target table.
  23. Click New table icon to create a new table, or click an existing table that will be overwritten by your job.
    Tip
    If you select a table and the View Profile icon is enabled, you can click that icon to display a profile report for that table.
  24. Click Next to display the Result task. In the Result task, click Save or Save As to save your directive. If you want to run your job now, click Start transforming data. Otherwise, you can run your job later from Saved Directives.

About the Operators in the Filter Data Transformation

The following table describes filter operators by the data type of the selected column.
Logical Operators in the Filter Transformation
Operator
Source ColumnData Types
Description and Example
Equal To
The Equal To operator is available for use with all source data types, which include the following:
Character character data type icon
Numeric numeric data type icon
Datetime datetime data type icon
The source value is accepted and its row is written to the target table only when the source value exactly matches the comparator.
Character values can be case-sensitive. Blank spaces are included in the comparison.
Datetime values in the comparator use the SAS format DATETIME(w.p).
Gender Equal To Male
PrefCustomer Equal To 1
SaleDate Equal To 5/1/2014
Not Equal To
character data type icon  numeric data type icon  datetime data type icon
Accepts the source row when the column value is anything other than the comparator.
Region Not Equal To Europe
NumChildren Not Equal To 0
SaleDate Not Equal To 11/25/2013
Null
character data type icon  numeric data type icon  datetime data type icon
When the runtime target is MapReduce, the Null operator accepts the source row when the column value is NULL or if no source value is present. When the runtime target is Spark, the Null operator accepts only column values of NULL.
CreditScore Null
AnnualIncome Null
Not Null
character data type icon  numeric data type icon  datetime data type icon
Accepts the source row when the column value is present and when the value is not NULL.
PostalCode Not Null
PhoneNumber Not Null
In
character data type icon  numeric data type icon
Accepts the source row when the column value is included in its entirety within the comparator. The comparator consists of a list of constant values. The list consists of a vertical list of individual entries, without commas. Blank spaces are interpreted literally. Case sensitivity can be enabled.
CarManuf In BMW                        
VW
Benz
WaistSize In
32
34
36
38
Not In
character data type icon  numeric data type icon
Accepts the source row when the column value is not included anywhere within the comparator’s list of constant values.
City Not In New York                      
 Chicago
 Los Angeles
WaistSize Not In 32
 34
36
38
Like
character data type icon
Accepts the source row when the column value matches the result of an expression in the comparator. The source value and the comparator are compared on a character-by-character basis. Case-sensitivity can be enabled.
Use the pattern-matching character % to indicate any string of characters. Use the underscore character _ to indicate any single character in that position.
Note that trailing blank characters are written to the target table when using % at the end of the comparator.
Use the word escape to include literal instances of % and _ in the comparator.
SalesRegion Like NorthAmer%
AnnualSales Like 199_
CustSatisfaction Like 100 escape %
Not Like
character data type icon
Accepts the source row when the column value does not match the result of an expression in the comparator. The source value and the comparator are compared on a character-by-character basis. Case-sensitivity can be enabled. Pattern-matching characters % and _ and escape are valid as described for the Like operator.
Sports Not Like %ball
FootballFieldLength Not Like 100%
Contains
character data type icon
Accepts the source row when the column value is found within the character string of the comparator. Case-sensitivity can be enabled.
Address Contains IL
LicenseNumber Contains 7227
Not Contains
character data type icon  
Accepts the source row when the column value is not found within the character string of the comparator, or is null. Case-sensitivity can be enabled.
Month Not Contains OctNovDec
SalesMonthly Not Contains 0
Between
numeric data type icon  datetime data type icon
Accepts the source row when the column value or date is between the two values or dates in the comparator, but is not equal to either.
GradeAverage Between 87.5 93
DailySales Between December 20, 2014 December 27, 2014
Greater Than
numeric data type icon
Accepts the source row when the column value is greater than the value of the comparator.
AnnualSales GreaterThan 100000
Greater Than Or Equal To
numeric data type icon
Accepts the source row when the column value is equal to the comparator or greater than the comparator.
CarsInFamily Greater Than or Equal To 3
Less Than
numeric data type icon
Accepts the source row when the column value is less than the value of the comparator.
GamerAge Less Than 30
Less Than Or Equal To
numeric data type icon
Accepts the source row when the column value is equal to the value of the comparator, or less than the value of the comparator.
SalesYear Less Than Or Equal To 2010
After
datetime data type icon
Accepts the source row when the column date is later than the date in the comparator.
HomePurchaseDate After January 1, 2013
Before
datetime data type icon
Accepts the source row when the column date is earlier than the date in the comparator.
BirthDate Before March 17, 1980
On Or After
datetime data type icon
Accepts the source row when the column date is later than, or the same date as, the date in the comparator.
DailySales On Or After January 1, 2014
On Or Before
datetime data type icon
Accepts the source row when the column date is earlier than, or the same date as, the date in the comparator.
DailySales On Or Before December 31, 2013

About the Aggregations in the Summarize Rows Transformation

The aggregations that are available in the Summarize Rows transformation are defined as follows:
Count
the number of rows in the group that contain valid values.
Count Distinct
the number of unique values in the column for each group.
Corrected Sum of Squares
measures variability or dispersion around the mean. To learn more about this (and other) statistical summaries, see the Introduction to Statistical Modeling with SAS/STAT Software.
Covariance
measures the strength of the correlation of the values in the group. A positive value indicates that values move in the same direction within the group. A negative value indicates that values move in opposite or random directions.
Max
the maximum value in the column for each group.
Mean
the calculated center value between the maximum and minimum values in the group.
Min
the minimum value in the group.
Number of Missing Values
the number of rows in the group that contain a blank or NULL value.
Range
the difference between the lowest and highest values in the group.
Standard Deviation
measures the degree of variance, or the degree in which the values in the group deviate from the mean. A small value indicates little deviation. The standard deviation is the square root of the Variance.
Standard Error
measures the applicability or accuracy of the mean as it applies to the values in the group. A small value indicates that the mean is a more accurate reflection of the values in the group.
Sum
adds the values in the group
Variance
the average of the squared differences from the mean, which measure diversity in the group