Bulk Loading for Greenplum

Overview

Bulk loading provides high-performance access to external data sources. Multiple Greenplum instances read data in parallel, which enhances performance.
Bulk loading enables you to insert large data sets into Greenplum tables in the shortest span of time. You can also use bulk loading to execute high-performance SQL queries against external data sources, without first loading those data sources into a Greenplum database. These fast SQL queries enable you to optimize the extraction, transformation, and loading tasks that are common in data warehousing.
Two types of external data sources, external tables and Web tables, have different access methods. External tables contain static data that can be scanned multiple times. The data does not change during queries. Web tables provide access to dynamic data sources as if those sources were regular database tables. Web tables cannot be scanned multiple times. The data can change during the course of a query.
The following sections show you how to access external tables and Web tables using the bulk-loading facility.

Using Protocols to Access External Tables

Use these protocols to access (static) external tables.
gpfdist://
To use the gpfdist:// protocol, install and configure the gpfdist (Greenplum file distribution) program on the host that stores the external tables see Configuring the File Server. The gpfdist utility serves external tables in parallel to the primary Greenplum database segments. The gpfdist:// protocol is advantageous because it ensures that all Greenplum database segments are used during the loading of external tables.
To specify files to gpfdist, use the BL_DATAFILE= data set option. Specify file paths that are relative to the directory from which gpfdist is serving files (the directory where you executed gpfdist).
The gpfdist utility is part of the loader package for the platform where SAS is running. You can also download it from the Greenplum Web site: www.greenplum.com.
file://
To use the file:// protocol, external tables must reside on a segment host in a location that Greenplum superusers (gpadmin) can access. The segment host name must match the host name, as specified in the gp_configuration system catalog table. In other words, the external tables that you want to load must reside on a host that is part of the set of servers that comprise the database configuration. The file:// protocol is advantageous because it does not require configuration.

Configuring the File Server

Follow these steps to configure the gpfdist file server.
  1. Download and install gpfdist from www.greenplum.com.
  2. Define and load a new environment variable called GPLOAD_HOME.
  3. Set the value of the variable to the directory that contains the external tables that you want to load.
    The directory path must be relative to the directory in which you execute gpfdist, and it must exist before gpfdist tries to access it.
    • For Windows, open My Computer, select the Advanced tab, and click the Environment Variables button.
    • For UNIX, enter this command or add it to your profile:
      export GPLOAD_HOME=directory
  4. Start gpfdist as shown in these examples.
    • For Windows:
      C:> gpfdist -d %GPLOAD_HOME% -p 8081 -l %GPLOAD_HOME%\gpfdist.log
    • For UNIX:
      $ gpfdist -d $GPLOAD_HOME -p 8081 -l $GPLOAD_HOME/gpfdist.log &
You can run multiple instances of gpfdist on the same host as long each instance has a unique port and directory.
If you do not set GPLOAD_HOME, the value of the BL_DATAFILE= data set option specifies the directory that contains the external tables to be loaded. If BL_DATAFILE is not specified, the current directory is assumed to contain the external tables.

Stopping gpfdist

In Windows, to stop an instance of gpfdist, use the Task Manager or close the Command Window that you used to start that instance of gpfdist.
Follow these steps In UNIX to stop an instance of gpfdist.
  1. Find the process ID:
    $ ps ax | grep gpfdist (Linux)
    $ ps -ef | grep gpfdist (Solaris)
  2. Kill the process. Here is an example:
    $ kill 3456

Troubleshooting gpfdist

Run this command to test connectivity between an instance of gpfdist and a Greenplum database segment.
$ wget http://gpfdist_hostname:port/filename

Using the file:// Protocol

You can use the file:// protocol to identify external files for bulk loading with no additional configuration required. However, using the GPLOAD_HOME environment variable is highly recommended. If you do not specify GPLOAD_HOME, the BL_DATAFILE data set option specifies the source directory. The default source directory is the current directory if you do not set BL_DATAFILE=. The Greenplum server must have access to the source directory.

Accessing Dynamic Data in Web Tables

Use these data set options to access Web tables:

Examples

This first example shows how you can use a SAS data set, SASFLT.FLT98, to create and load a Greenplum table, FLIGHTS98.
libname sasflt 'SAS-data-library';
libname mydblib greenplm host=iqsvr1 server=iqsrv1_users
    db=users user=iqusr1 password=iqpwd1;

proc sql;
create table net_air.flights98
    (BULKLOAD=YES
     BL_DATAFILE='c:\temp\greenplum\data.dat'
     BL_USE_PIPE=NO
     BL_DELETE_DATAFILE=yes
     BL_HOST='192.168.x.x'
     BL_PORT=8081)
    as select * from sasflt.flt98;
quit;
This next example shows how you can append the SAS data set, SASFLT.FLT98, to the existing Greenplum table ALLFLIGHTS. The BL_USE_PIPE=NO option forces SAS/ACCESS Interface to Greenplum to write data to a flat file, as specified in the BL_DATAFILE= option. Rather than deleting the data file, BL_DELETE_DATAFILE=NO causes the engine to leave it after the load has completed.
proc append base=new_air.flights98
    (BULKLOAD=YES
     BL_DATAFILE='c:\temp\greenplum\data.dat'
     BL_USE_PIPE=NO
     BL_DELETE_DATAFILE=yes
     BL_HOST='192.168.x.x'
     BL_PORT=8081)
    data=sasflt.flt98;
run;