SAS/ACCESS Interface to Netezza |
Using SAS Formats |
SAS formats are basically mapping functions that change an element of data from one format to another. For example, some SAS formats change numeric values to various currency formats or date-and-time formats.
SAS supplies many formats. You can also use the SAS FORMAT procedure to define custom formats that replace raw data values with formatted character values. For example, this PROC FORMAT code creates a custom format called $REGION that maps ZIP codes to geographic regions.
proc format; value $region '02129', '03755', '10005' = 'Northeast' '27513', '27511', '27705' = 'Southeast' '92173', '97214', '94105' = 'Pacific'; run;
SAS programs frequently use both user-defined formats and formats that SAS supplies. Although they are referenced in numerous ways, using the PUT function in the SQL procedure is of particular interest for SAS In-Database processing.
The PUT function takes a format reference and a data item as input and returns a formatted value. This SQL procedure query uses the PUT function to summarize sales by region from a table of all customers:
select put(zipcode,$region.) as region, sum(sales) as sum_sales from sales.customers group by region;
The SAS SQL processor knows how to process the PUT function. Currently, SAS/ACCESS Interface to Netezza returns all rows of unformatted data in the SALES.CUSTOMERS table in the Netezza database to the SAS System for processing.
The SAS In-Database technology deploys, or publishes, the PUT function implementation to Netezza as a new function named SAS_PUT(). Similar to any other programming language function, the SAS_PUT() function can take one or more input parameters and return an output value.
The SAS_PUT() function supports use of SAS formats. You can specify the SAS_PUT() function in SQL queries that SAS submits to Netezza in one of two ways:
implicitly by enabling SAS to automatically map PUT function calls to SAS_PUT() function calls
explicitly by using the SAS_PUT() function directly in your SAS program
If you used the SAS_PUT() function in the previous example, Netezza formats the ZIP code values with the $REGION format and processes the GROUP BY clause using the formatted values.
By publishing the PUT function implementation to Netezza as the SAS_PUT() function, you can realize these advantages:
You can process the entire SQL query inside the database, which minimizes data transfer (I/O).
The SAS format processing leverages the scalable architecture of the DBMS.
The results are grouped by the formatted data and are extracted from the Netezza data warehouse.
Deploying SAS formats to execute inside a Netezza database can enhance performance and exploit Netezza parallel processing.
How It Works |
By using the SAS formats publishing macro, you can generate a SAS_PUT() function that enables you to execute PUT function calls inside the Netezza data warehouse. You can reference the formats that SAS supplies and most custom formats that you create by using PROC FORMAT.
The SAS formats publishing macro takes a SAS format catalog and publishes it to the Netezza data warehouse. Inside the Netezza data warehouse, a SAS_PUT() function, which emulates the PUT function, is created and registered for use in SQL queries.
Process Flow Diagram
Here is the basic process flow.
Install the components that are necessary for in-database processing in the Netezza data warehouse. For more information, see Deployed Components for In-Database Processing. Note: This is a one-time installation process. | |
If you need to, create your custom formats by using PROC FORMAT and use the LIBRARY= option to create a permanent catalog. For more information, see User-Defined Formats in the Netezza Data Warehouse and the FORMAT procedure in the Base SAS Procedures Guide. | |
Start SAS 9.2 and run the SAS publishing macros. For more information, see Publishing SAS Formats. | |
After the SAS_PUT() function is created, it is available to use in any SQL expression in the same way that Netezza built-in functions are used. For more information, see Using the SAS_PUT() Function in the Netezza Data Warehouse. |
Deployed Components for In-Database Processing |
Components that are deployed to Netezza for in-database processing are contained in a self-extracting TAR file on the SAS Software Depot.
The following components are deployed:
the SAS 9.2 Formats Library for Netezza. The library contains many formats that are available in Base SAS. After you install the SAS 9.2 Formats Library and run the %INDNZ_PUBLISH_FORMATS macro, the SAS_PUT() function can call these formats.
Note: The SAS Scoring Accelerator for Netezza also uses these libraries. For more information about this product, see the SAS Scoring Accelerator for Netezza: User's Guide.
the SAS Accelerator Publishing Agent. The SAS Accelerator Publishing Agent contains all macros that are needed for publishing the SAS 9.2 Formats Library (TwinFin systems only), the SAS_PUT() function, and user-defined formats for Netezza:
%INDNZ_PUBLISH_JAZLIB (TwinFin systems only). The %INDNZ_PUBLISH_JAZLIB macro publishes the SAS 9.2 Formats Library for Netezza.
%INDNZ_PUBLISH_COMPILEUDF. The %INDNZ_PUBLISH_COMPILEUDF macro creates the SAS_COMPILEUDF, SAS_DIRECTORYUDF, and SAS_HEXTOTEXTUDF functions that are needed to facilitate the publishing of the SAS_PUT() function and formats.
%INDNZ_PUBLISH_FORMATS. The %INDNZ_PUBLISH_FORMATS macro publishes the SAS_PUT() function and formats.
The %INDNZ_PUBLISH_JAZLIB and %INDNZ_PUBLISH_COMPILEUDF macros are typically run by your system or database administrator.
For more information about creating the SAS Software Depot, see your Software Order e-mail. For more information about installing and configuring these components, see the SAS In-Database Products: Administrator's Guide.
User-Defined Formats in the Netezza Data Warehouse |
You can use PROC FORMAT to create user-defined formats and store them in a format catalog. You can then use the %INDNZ_PUBLISH_FORMATS macro to export the user-defined format definitions to the Netezza data warehouse where the SAS_PUT() function can reference them.
If you use the FMTCAT= option to specify a format catalog in the %INDNZ_PUBLISH_FORMATS macro, these restrictions and limitations apply:
Trailing blanks in PROC FORMAT labels are lost when publishing a picture format.
Avoid using PICTURE formats with the MULTILABEL option. You cannot successfully create a CNTLOUT= data set when PICTURE formats are present. This a known issue with PROC FORMAT.
The following limitations apply if you are using a character set encoding other than Latin1:
Picture formats are not supported. The picture format supports only Latin1 characters.
If the format value's encoded string is longer than 256 bytes, the string is truncated and a warning is printed to the SAS log.
If you use the MULTILABEL option, only the first label that is found is returned. For more information, see the PROC FORMAT MULTILABEL option in the Base SAS Procedures Guide.
The %INDNZ_PUBLISH_FORMATS macro rejects a format unless the LANGUAGE= option is set to English or is not specified.
Although the format catalog can contain informats, the %INDNZ_PUBLISH_FORMATS macro ignores the informats.
User-defined formats that include a format that SAS supplies are not supported.
Publishing SAS Formats |
The SAS publishing macros are used to publish formats and the SAS_PUT() function in Netezza.
The %INDNZ_PUBLISH_FORMATS macro creates the files that are needed to build the SAS_PUT() function and publishes those files to the Netezza data warehouse.
This macro also registers the formats that are included in the SAS 9.2 Formats Library for Netezza. In addition to registering the formats that SAS supplies, you can publish the PROC FORMAT definitions that are contained in a single SAS format catalog. The process of publishing a PROC FORMAT catalog entry converts the value-range-sets, for example, 1='yes' 2='no', into embedded data in Netezza. For more information on value-range-sets, see PROC FORMAT in the Base SAS Procedures Guide.
The %INDNZ_PUBLISH_FORMATS macro performs the following tasks:
produces the set of .c, .cpp, and .h files that are necessary to build the SAS_PUT() function
produces a script of the Netezza commands that are necessary to register the SAS_PUT() function on the Netezza data warehouse
transfers the .c, .cpp, and .h files to Netezza using the Netezza External Table interface
calls the SAS_COMPILEUDF function to compile the source files into object files and to access the SAS 9.2 Formats Library for Netezza
uses SAS/ACCESS Interface to Netezza to run the script to create the SAS_PUT() function with the object files
To run the %INDNZ_PUBLISH_FORMATS macro, complete the following steps:
Start SAS 9.2 and submit these commands in the Program Editor or Enhanced Editor:
%indnzpf;
%let indconn = server=myserver user=myuserid password=XXXX
database=mydb;
The %INDNZPF macro is an autocall library that initializes the format publishing software.
The INDCONN macro variable is used as credentials to connect to Netezza. You must specify the server, user, password, and database to access the machine on which you have installed the Netezza data warehouse. You must assign the INDCONN macro variable before the %INDNZ_PUBLISH_FORMATS macro is invoked.
Here is the syntax for the value of the INDCONN macro variable:
SERVER=server USER=userid PASSWORD=password DATABASE=database |
Note: You can use only PASSWORD= or PW= for the password argument. Other aliases such as PASS= or PWD= are not supported and cause errors.
Note: The INDCONN macro variable is not passed as an argument to the %INDNZ_PUBLISH_FORMATS macro. This information can be concealed in your SAS job. You might want to place it in an autoexec file and set the permissions on the file so that others cannot access the user ID and password.
Run the %INDNZ_PUBLISH_FORMATS macro. For more information, see %INDNZ_PUBLISH_FORMATS Macro Syntax.
Messages are written to the SAS log that indicate whether the SAS_PUT() function was successfully created.
%INDNZ_PUBLISH_FORMATS
(
<, DATABASE=database-name> <. DBCOMPILE=database-name> <, DBJAZLIB=database-name> <, FMTCAT=format-catalog-filename> <, FMTTABLE=format-table-name> <, ACTION=CREATE | REPLACE | DROP> <, OUTDIR=diagnostic-output-directory> ); |
specifies the name of a Netezza database to which the SAS_PUT() function and the formats are published. This argument lets you publish the SAS_PUT() function and the formats to a shared database where other users can access them.
Interaction: | The database that is specified by the DATABASE= argument takes precedence over the database that you specify in the INDCONN macro variable. For more information, see Running the %INDNZ_PUBLISH_FORMATS Macro. |
Tip: | It is not necessary that the format definitions and the SAS_PUT() function reside in the same database as the one that contains the data that you want to format. You can use the SQLMAPPUTO= system option to specify the database where the format definitions and the SAS_PUT() function have been published. |
specifies the name of the database where the SAS_COMPILEUDF function was published.
Default: | SASLIB |
See: | For more information about the publishing the SAS_COMPILEUDF function, see the SAS In-Database Products: Administrator's Guide. |
specifies the name of the database where the SAS 9.2 Formats Library for Netezza was published.
Default: | SASLIB |
Restriction: | This argument is supported only on TwinFin systems. |
See: | For more information about publishing the SAS 9.2 Formats Library on TwinFin systems, see the SAS In-Database Products: Administrator's Guide. |
specifies the name of the format catalog file that contains all user-defined formats that were created with the FORMAT procedure and will be made available in Netezza.
Default: | If you do not specify a value for FMTCAT= and you have created user-defined formats in your SAS session, the default is WORK.FORMATS. If you do not specify a value for FMTCAT= and you have not created any user-defined formats in your SAS session, only the formats that SAS supplies are available in Netezza. |
Interaction: | If the format definitions that you want to publish exist in multiple catalogs, you must copy them into a single catalog for publishing. |
Interaction: | If you specify more than one format catalog using the FMTCAT argument, only the last catalog specified is published. |
Interaction: | If you do not use the default catalog name (FORMATS) or the default library (WORK or LIBRARY) when you create user-defined formats, you must use the FMTSEARCH system option to specify the location of the format catalog. For more information, see PROC FORMAT in the Base SAS Procedures Guide. |
See Also: | User-Defined Formats in the Netezza Data Warehouse |
specifies the name of the Netezza table that contains all formats that the %INDNZ_PUBLISH_FORMATS macro creates and that the SAS_PUT() function supports. The table contains the columns shown in Table 2.1.
Column Name | Description | ||||
---|---|---|---|---|---|
FMTNAME | specifies the name of the format. | ||||
SOURCE |
specifies the origin of the format. SOURCE can contain one of these
values:
|
Default: | If FMTTABLE is not specified, no table is created. You can see only the SAS_PUT() function. You cannot see the formats that are published by the macro. |
Interaction: | If ACTION=CREATE or ACTION=DROP is specified, messages are written to the SAS log that indicate the success or failure of the table creation or drop. |
specifies that the macro performs one of these actions:
creates a new SAS_PUT() function.
overwrites the current SAS_PUT() function, if a SAS_PUT() function is already registered or creates a new SAS_PUT() function if one is not registered.
causes the SAS_PUT() function to be dropped from the Netezza database.
Interaction: | If FMTTABLE= is specified, both the SAS_PUT() function and the format table are dropped. If the table name cannot be found or is incorrect, only the SAS_PUT() function is dropped. |
Default: | CREATE |
Tip: | If the SAS_PUT() function was published previously and you specify ACTION=CREATE or REPLACE, no warning is issued. Also, if you specify ACTION=DROP and the SAS_PUT() function does not exist, no warning is issued. |
specifies a directory that contains diagnostic files.
Files that are produced include an event log that contains detailed information about the success or failure of the publishing process.
See: | Special Characters in Directory Names |
Use the ACTION=CREATE option only the first time that you run the %INDNZ_PUBLISH_FORMATS macro. After that, use ACTION=REPLACE or ACTION=DROP.
The %INDNZ_PUBLISH_FORMATS macro does not require a format catalog. To publish only the formats that SAS supplies, you need to have either no format catalog or an empty format catalog. You can use this code to create an empty format catalog in your WORK directory before you publish the PUT function and the formats that SAS supplies:
proc format; run;
If you modify any PROC FORMAT entries in the source catalog, you must republish the entire catalog.
When SAS parses the PUT function, SAS checks to make sure that the format is a known format name. SAS looks for the format in the set of formats that are defined in the scope of the current SAS session. If the format name is not defined in the context of the current SAS session, the SAS_PUT() function is returned to the local SAS session for processing.
Using both the SQLREDUCEPUT= system option (or the PROC SQL REDUCEPUT= option) and SQLMAPPUTTO= can result in a significant performance boost. First, SQLREDUCEPUT= works to reduce as many PUT functions as possible. Then you can map the remaining PUT functions to SAS_PUT() functions, by setting SQLMAPPUTTO= SAS_PUT.
If the %INDNZ_PUBLISH_FORMATS macro is executed between two procedure calls, the page number of the last query output is increased by two.
If the directory names that are used in the macros contain any of the following special characters, you must mask the characters by using the %STR macro quoting function. For more information, see the %STR function and macro string quoting topic in SAS Macro Language: Reference.
Character | How to Represent |
---|---|
blank1 | %str() |
*2 | %str(*) |
; | %str(;) |
, (comma) | %str(,) |
= | %str(=) |
+ | %str(+) |
- | %str(-) |
> | %str(>) |
< | %str(<) |
^ | %str(^) |
| | %str(|) |
& | %str(&) |
# | %str(#) |
/ | %str(/) |
~ | %str(~) |
% | %str(%%) |
' | %str(%') |
" | %str(%") |
( | %str(%() |
) | %str(%)) |
¬ | %str(¬) |
1
Only leading blanks require the %STR function, but you
should avoid
using leading blanks in directory names.
2 Asterisks (*) are allowed in UNIX directory names. Asterisks are not allowed in Windows directory names. In general, avoid using asterisks in directory names. |
Here are some examples of directory names with special characters:
Directory | Code Representation |
---|---|
c:\temp\Sales(part1) |
c:\temp\Sales%str(%()part1%str(%)) |
c:\temp\Drug "trial" X |
c:\temp\Drug %str(%")trial(%str(%") X |
c:\temp\Disc's 50% Y |
c:\temp\Disc%str(%')s 50%str(%%) Y |
c:\temp\Pay,Emp=Z |
c:\temp\Pay%str(,)Emp%str(=)Z |
You must have permission to create the SAS_PUT() function and formats, and tables in the Netezza database. You must also have permission to execute the SAS_COMPILEUDF, SAS_DIRECTORYUDF, and SAS_HEXTOTEXTUDF functions in either the SASLIB database or the database specified in lieu of SASLIB where these functions are published.
Without these permissions, the publishing of the SAS_PUT() function and formats fail. To obtain these permissions, contact your database administrator.
For more information on specific permissions, see the SAS In-Database Products: Administrator's Guide.
%indnzpf;
%let indconn = server=netezbase user=user1 password=open1 database=mydb;
%indnz_publish_formats(fmtcat= fmtlib.fmtcat);
This sequence of macros generates .c, .cpp, and .h files for each data type. The format data types that are supported are numeric (FLOAT, INT), character, date, time, and timestamp (DATETIME). The %INDNZ_PUBLISH_FORMATS macro also produces a text file of Netezza CREATE FUNCTION commands that are similar to these:
CREATE FUNCTION sas_put(float , varchar(256)) RETURNS VARCHAR(256) LANGUAGE CPP PARAMETER STYLE npsgeneric CALLED ON NULL INPUT EXTERNAL CLASS NAME 'Csas_putn' EXTERNAL HOST OBJECT '/tmp/tempdir_20090528T135753_616784/formal5.o_x86' EXTERNAL NSPU OBJECT '/tmp/tempdir_20090528T135753_616784/formal5.o_diab_ppc'
After it is installed, you can call the SAS_PUT() function in Netezza by using SQL. For more information, see Using the SAS_PUT() Function in the Netezza Data Warehouse.
Using the SAS_PUT() Function in the Netezza Data Warehouse |
After you install the formats that SAS supplies in libraries inside the Netezza data warehouse and publish any custom format definitions that you created in SAS, you can access the SAS_PUT() function with your SQL queries.
If the SQLMAPPUTTO= system option is set to SAS_PUT and you submit your program from a SAS session, the SAS SQL processor maps PUT function calls to SAS_PUT() function references that Netezza understands.
This example illustrates how the PUT function is mapped to the SAS_PUT() function using implicit pass-through.
options sqlmapputto=sas_put; %put &mapconn; libname dblib netezza &mapconn; /*-- Set SQL debug global options --*/ /*----------------------------------*/ options sastrace=',,,d' sastraceloc=saslog; /*-- Execute SQL using Implicit Passthru --*/ /*-----------------------------------------*/ proc sql noerrorstop; title1 'Test SAS_PUT using Implicit Passthru '; select distinct PUT(PRICE,Dollar8.2) AS PRICE_C from dblib.mailorderdemo; quit;
These lines are written to the SAS log.
options sqlmapputto=sas_put; %put &mapconn; user=dbitext password=dbigrp1 server=spubox database=TESTDB sql_functions="EXTERNAL_APPEND=WORK.dbfuncext" sql_functions_copy=saslog; libname dblib netezza &mapconn; NOTE: Libref DBLIB was successfully assigned, as follows: Engine: NETEZZA Physical Name: spubox /*-- Set SQL debug global options --*/ /*----------------------------------*/ options sastrace=',,,d' sastraceloc=saslog; /*-- Execute SQL using Implicit Passthru --*/ /*-----------------------------------------*/ proc sql noerrorstop; title1 'Test SAS_PUT using Implicit Passthru '; select distinct PUT(PRICE,Dollar8.2) AS PRICE_C from dblib.mailorderdemo ; NETEZZA: AUTOCOMMIT is NO for connection 1 NETEZZA: AUTOCOMMIT turned ON for connection id 1 NETEZZA_1: Prepared: on connection 1 SELECT * FROM mailorderdemo NETEZZA: AUTOCOMMIT is NO for connection 2 NETEZZA: AUTOCOMMIT turned ON for connection id 2 NETEZZA_2: Prepared: on connection 2 select distinct cast(sas_put(mailorderdemo."PRICE", 'DOLLAR8.2') as char(8)) as PRICE_C from mailorderdemo NETEZZA_3: Executed: on connection 2 Prepared statement NETEZZA_2 ACCESS ENGINE: SQL statement was passed to the DBMS for fetching data. Test SAS_PUT using Implicit Passthru 9 13:42 Thursday, May 7, 2009 PRICE_C _______ $10.00 $12.00 $13.59 $48.99 $54.00 $8.00 $14.00 $27.98 $13.99 quit;
Be aware of these items:
The SQLMAPPUTTO= system option must be set to SAS_PUT to ensure that the SQL processor maps your PUT functions to the SAS_PUT() function and the SAS_PUT() reference is passed through to Netezza.
The SAS SQL processor translates the PUT function in the SQL SELECT statement into a reference to the SAS_PUT() function.
select distinct cast(sas_put("sas"."mailorderdemo"."PRICE", 'DOLLAR8.2') as char(8)) as "PRICE_C" from "sas"."mailorderdemo"
A large value, VARCHAR(n), is always returned because one function prototype accesses all formats. Use the CAST expression to reduce the width of the returned column to be a character width that is reasonable for the format that is being used.
The return text cannot contain a binary zero value (hexadecimal 00) because the SAS_PUT() function always returns a VARCHAR(n) data type and a Netezza VARCHAR(n) is defined to be a null-terminated string.
The format of the SAS_PUT() function parallels that of the PUT function:
SAS_PUT(source, 'format.')
The SELECT DISTINCT clause executes inside Netezza, and the processing is distributed across all available data nodes. Netezza formats the price values with the $DOLLAR8.2 format and processes the SELECT DISTINCT clause using the formatted values.
If you use explicit pass-through (direct connection to Netezza), you can use the SAS_PUT() function call in your SQL program.
This example shows the same query from Implicit Use of the SAS_PUT() Function and explicitly uses the SAS_PUT() function call.
options sqlmapputto=sas_put sastrace=',,,d' sastraceloc=saslog; proc sql noerrorstop; title1 'Test SAS_PUT using Explicit Passthru'; connect to netezza (user=dbitest password=XXXXXXX database=testdb server=spubox); select * from connection to netezza (select distinct cast(sas_put("PRICE",'DOLLAR8.2') as char(8)) as "PRICE_C" from mailorderdemo); disconnect from netezza; quit;
The following lines are written to the SAS log.
options sqlmapputto=sas_put sastrace=',,,d' sastraceloc=saslog; proc sql noerrorstop; title1 'Test SAS_PUT using Explicit Passthru'; connect to netezza (user=dbitest password=XXXXXXX database=testdb server=spubox); select * from connection to netezza (select distinct cast(sas_put("PRICE",'DOLLAR8.2') as char(8)) as "PRICE_C" from mailorderdemo); Test SAS_PUT using Explicit Passthru 2 17:13 Thursday, May 7, 2009 PRICE_C _______ $27.98 $10.00 $12.00 $13.59 $48.99 $54.00 $13.98 $8.00 $14.00 disconnect from netezza; quit;
Note: If you explicitly use the SAS_PUT() function in your code, it is recommended that you use double quotation marks around a column name to avoid any ambiguity with the keywords. For example, if you did not use double quotation marks around the column name, DATE, in this example, all date values would be returned as today's date.
select distinct cast(sas_put("price", 'dollar8.2') as char(8)) as "price_c", cast(sas_put("date", 'date9.1') as char(9)) as "date_d", cast(sas_put("inv", 'best8.') as char(8)) as "inv_n", cast(sas_put("name", '$32.') as char(32)) as "name_n" from mailorderdemo;
Determining Format Publish Dates |
You might need to know when user-defined formats or formats that SAS supplies were published. SAS supplies two special formats that return a datetime value that indicates when this occurred.
The INTRINSIC-CRDATE format returns a datetime value that indicates when the SAS 9.2 Formats Library was published.
The UFMT-CRDATE format returns a datetime value that indicates when the user-defined formats were published.
Note: You must use the SQL pass-through facility to return the datetime value associated with the INTRINSIC-CRDATE and UFMT-CRDATE formats, as illustrated in this example.
proc sql noerrorstop; connect to &netezza (&connopt); title 'Publish date of SAS Format Library'; select * from connection to &netezza ( select sas_put(1, 'intrinsic-crdate.') as sas_fmts_datetime; ); title 'Publish date of user-defined formats'; select * from connection to &netezza ( select sas_put(1, 'ufmt-crdate.') as my_formats_datetime; ); disconnect from netezza; quit;
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.