The team at Groceryrama
and GreenVillage can begin to act by designing a system for dealing
with their data needs and executing the processes defined in that
system. Start the design phase by taking stock of all of these different
structures, formats, data sources, and data feeds in the define and
discover phases of the data management methodology. Then you can create
an environment that accommodates the needs of your business. In the
design phase, you consolidate and coordinate your data management
activities by concentrating on the following imperatives:
-
Consistency of rules: Ultimately,
an organization needs one set of business rules that can be stored
centrally but deployed across all data sources, applications, and
lines of business.
-
Consistency of the
data model: The data model is the single, definitive
source for how your data maps to your business. Through the process
of creating a well-structured data model, you identify the appropriate
source systems and begin to reconcile multiple views, if required.
-
Consistency of business
processes: During the planning stage, you identify processes
that are potentially impacted. Now, the task is to provide consistency
across these processes.
This is the time to
gather teams of business analysts, data architects, and IT specialists
and begin to make practical decisions about how the data will be organized
and regulated.
For example, you need
to make sure that the content of the data files works together. Do
your customer names have same format? Or do some sources put the surname
first and others put it last? Are all of dates in the same format?
Do your product lists contain duplicated records? These questions
and other like them can be addressed with the data quality and master
data management functions in applications such as DataFlux Data Management
Studio and SAS Master Data Management.
Another common data
problem is choosing among similar, but not identical, rows of data.
For example, you might have records in the supplier information tables
for both of the merging entities that appear to refer to the same
supplier. The supplier records for Acme Pork Products for one company
spell out the name of the state where the supplier is located and
include a postal code. However, the supplier records for the other
company use a two-letter state code and omit the postal code. Which
supplier records should the merged Groceryrama and GreenVillage enterprise
use?
You can use the entity
resolution tools in SAS Master Data Management to diagnose the extent
of the problem. Then you can designate one survivor record for suppliers
that you can use throughout the enterprise. Business users have established
how the data and rules should be defined. The IT staff can now ensure
that databases and applications comply with the definitions.
SAS Data Integration
Studio is an Enterprise extract, transform, load (ETL) application
that gives a big productivity boost to SAS coders doing data preparation
and data management. It contains several features that help you get
your data working together. First, you can register your data as metadata
and group it into libraries of source and target tables. Then, you
add the items in the libraries to job flows that enable you to perform
the extract, transform, and load tasks at the core of data integration.
Finally, you can deploy these jobs and schedule them for execution
in batches. The application also supports related processes such as
SQL queries, table loading, analytics, and reporting.
DataFlux Data Management
Studio support data job flows and process job flows to improve data
quality. DataFlux Data Management Studio is designed to work with
DataFlux Data Management Server. Any authorized user can review and
work with the jobs on the server. SAS Visual Process Orchestration
adds the ability to integrate executable files from various systems
into a single process flow. It enables you to build orchestration
jobs, which are process jobs that run other jobs.
SAS Data Integration
Studio and DataFlux Data Management Studio are often used by data
management specialists. SAS Data Loader brings data management within
reach of the general business user. SAS Data Loader simplifies the
process of working with large distributed Hadoop data sources. Then
it provides a series of wizard-based directives that help you perform
tasks like transforming, profiling, and querying data in Hadoop
You can use SAS software
to quickly and efficiently acquire data from a wide variety of data
sources. For example, you can use
SAS/ACCESS interfaces for critical
sources such as the following:
-
Oracle, Sybase, DB2, and Microsoft
-
SQL Server, and Teradata databases
-
-
data from enterprise resource planning
applications such as SAS
Then, you can use the
external file wizards in SAS Data Integration Studio acquire data
from fixed-width, delimited, and user-written external files. You
can also use SAS Data Integration Studio to register metadata from
all of your data sources into libraries that are used in jobs. You
can work with all of this data in your SAS applications.