|
BI Case Study: Grid Computing Accelerates BI Analytics
|

National Institute of Enviromental Health Scences
(NIEHS) |
At the National Institute of Environmental Health Sciences (NIEHS),
life-saving research is a core competency. So when NIEHS scientists felt
that a lack of computing and analytic horsepower for their data warehouse
was slowing down their groundbreaking research into the environmental
causes of cancer, they developed a unique solution that marries
conventional approaches to data mining with a technology called grid
computing.
According to NIEHS officials, this strategy paid off beyond even the
most optimistic expectations of their researchers.
Over the last decade, NIEHS researchers advanced our understanding of
human biology by identifying the first breast cancer gene, along with a
gene that suppresses prostate cancer. It was at NIEHS that researchers
first demonstrated the deadly effects of asbestos exposure, the
development of impairment of children exposed to lead, and the health
effects associated with urban pollution.
The groundbreaking work done at NIEHS is enabled first and foremost by
the innovations of its researchers, one of whom was a recipient of the
1994 Nobel Prize in Medicine. But NIEHS researchers also rely on
sophisticated data modeling, data mining, and analytic software programs.
It’s doubtful, after all, that even the most innovative of Nobel
Prize-winning research scientists could make sense of the more than three
billion chemical base pairs in human DNA without the help of a data
warehouse.
Not surprisingly, then, NIEHS scientists rely upon a variety of
homegrown and commercial data mining and analytic software applications to
support their research efforts. Among other packaged software vendors,
NIEHS has tapped the expertise of SAS Institute Inc., perhaps because
they’re next door neighbors—SAS is in Cary, North Carolina, NIEHS in
Research Triangle Park—or, more likely, because SAS is one of the most
respected names in data mining and analytics. Regardless, NIEHS has
deployed SAS Enterprise Miner data mining software, along with other SAS
applications.
According to IT security officer and systems administrator Roy Reter,
NIEHS research scientists are using SAS to mine extremely large datasets
that can tax even the beefiest of server hardware platforms. These
datasets aggregate not just human genetic data, but also air quality data
and other environmental variables. As a result, Reter admits, he wasn’t
all that surprised when a team of scientists performing research into the
environmental causes of cancer expressed frustration with the limitations
of their server horsepower. They were dealing, after all, with
multi-terabyte datasets.
What did surprise him, Reter acknowledges, was a suggestion from one
NIEHS researcher that they use a distributed processing technology called
grid computing to scale SAS’ data mining software across several dozen
different servers. “We had built several grid computers for our scientists
to use, and one of them in particular came to us with the idea that we
could do [environmental cancer research] on a grid,” he explains. (There’s
more about grid computing in our BI Backgrounder, which follows this case
study.)
Intrigued, NIEHS researchers contacted a colleague at SAS, who helped
them to load instances of SAS on 32 individual Linux servers. NIEHS and
SAS researchers then used an SAS tool called SAS/Connect to intelligently
distribute application data to each of the servers. SAS/Connect is based
on an SAS feature called MP Connect, which allows multiple SAS sessions to
run in parallel, each comprising an instance of a larger application.
Theoretically, MP Connect can distribute a workload to an unlimited number
of systems across a network.
The result was a grouping of 32 different servers—what’s called a
grid—that are connected as a single system. The beauty of a technology
like SAS/Connect, Reter says, is that it allows an application to run
unmodified across a grid. As a result, NIEHS researchers didn’t have to
re-write their applications to communicate with each of the distributed
instances of SAS. As far as the applications are concerned, then, they’re
communicating with one (admittedly supercharged) instance of SAS.
The upshot, says Reter, is that deploying applications to exploit
multiple instances of SAS would have been almost impossible outside of the
context of a grid: “Basically, you’d be looking at manually trying to
break up a process against 32 computers, so you’re looking at the cost of
32 computers, and then having to run their own instance, then you manually
having to break apart that process.”
When NIEHS researchers finally tested their applications on the SAS
grid, they found that the scalability benefits were enormous, says Reter.
“The SAS grid has helped us to reduce by up to 95 percent in some cases
the execution time required for these key projects,” he reports. “I think
the particular test that we did with this ran just short of a day, and
probably just running one piece of this on one computer would have taken
over a week.”
With 32 Intel servers and instances of SAS running on each of them, the
NIEHS’ grid isn’t exactly inexpensive. Still, Reter refuses to speculate
about the cost of achieving similar performance using only one very large
system. In the first place, NIEHS probably wouldn’t be able to purchase an
Intel-based system large enough to match the performance of its 32-system
grid. Instead, the research institute would have to invest in larger and
more expensive systems from vendors such as Hewlett-Packard Co., IBM
Corp., or Sun Microsystems Inc.
As a result of this experience, Reter says he’s now a grid computing
enthusiast. He acknowledges that the technology isn’t a good choice for
all applications (see accompanying article), but says that for
organizations that support data mining or analytic applications that
exploit largedatasets, grid computing is the way to go.
“I think you’re looking at a revolutionary way of analyzing large
amounts of data in a way that’s just not practical otherwise,” he
comments. “I know for us, with the micro array of air quality data
andgenetic data that our scientists are looking at, you’re definitely
looking at very large amounts of data. How else are you going to
economically analyze it all?”
As for ROI, says Reter, it’s a nobrainer. To the extent that the SAS
grid furthers cancer research even one iota, he observes, it will more
than have paid for itself. Because the grid has so drastically ratcheted
up the performance of some of the NIEHS’ key cancer research applications,
it has more than delivered the goods. “It’s amazing the power that grid
computing has given us at such a reduced cost,” he concludes.
|
BI Backgrounder: Are Grids and BI Set to
Converge?
Chances are that you’ve heard of grid computing, although there’s
also an equally good chance that you haven’t given much thought to
its potential usefulness for business intelligence (BI) or data
mining. The irony, of course, is that since their inception, grid
computing technologies have been used extensively to support
applications such as data mining and analytics.
After all, grid computing has had its proving ground in a variety
of highly successful public computing projects, including the
University of California Berkeley’s SETI@Home distributed computing
project (see http://www.setiathome.com for details). When you think
about it, SETI@Home is nothing less than a massive distributed data
mining effort, parceling out data collected by radio telescopes to
hundreds of thousands of users who download a portion and analyze it
on a client module that runs on their computer when it’s not in use.
Once complete, the results of their analysis are uploaded to a
centralized server.
At least one vendor (SAS) argues that grid computing is a natural
fit for BI applications other than data mining, many of which work
with large data sets, as well. More to the point, says Tho Nguyen,
program director of data integration with SAS, many customers are
already exploiting technologies similar to grid computing and many
not even realize it.
As a result, when customers ask about grid computing, Nguyen
says, “We try to explain to them that you’ve already been dong it,
either by parallel processing or distributing workloads across a
network. These things have been utilized already, but grid computing
gives [customers] a more efficient way to utilize them. We’re
finding that some customers are coming to us because they understand
the potential value here.”
How does grid computing enable greater efficiencies than parallel
or distributed processing, both of which have been mainstays of data
mining for quite some time? For starters, grid computing isn’t a
strict servercentric proposition. Instead, it proposes to exploit
the un-utilized or under-utilized power of all computing resources
in a network environment—including desktop PCs.
The typical desktop PC has changed a lot over the last 20 years:
The term “PC” may once have described a low-end machine powered by
an 8-MHz 8088 or 80286 microprocessor and outfitted with scanty
memory resources, but today’s “PC” is more properly an entry-level
server. That’s because it often sports a 1-, 2- or even 3-GHz
processor, hundreds of gigabytes of hard disk storage
and—frequently—a gigabyte or more of memory under its hood. BI
Backgrounder: Are Grids and BI Set to Converge?
SAS’ Nguyen says that the success of initiatives such as
SETI@Home have demonstrated that idle processing power in client
workstations can be exploited in a grid. In an enterprise grid,
where workstations are connected on dedicated internal networks and
aren’t subject to the vicissitudes of the Internet, he suggests,
this value proposition is even stronger. “There’s an opportunity
there to take advantage of that computing horsepower, which is
underutilized during the day and which is typically unutilized
during off hours,” he argues.
SAS is pushing this argument with SAS/Connect, in spite of the
fact that it has only publicly trumpeted one customer win, the
National Institute of Environmental Health Sciences (NIEHS), which
exploits a dedicated grid of 32 connected servers, instead of idle
client workstations. (See our Case Study for details.)
In fact, says Roy Reter, IT security officer and systems
administrator with the NIEHS, the idea of tapping under-utilized
client processing power—while certainly intriguing—probably isn’t a
good fit for his organization. “A couple of us in the systems
administration group have thought about that, but right now, it’s
kind of hard to do, due to the fact that the scientists do their
work around here, science goes on around here 24 hours a day, five
days a week,” he concedes.
Nevertheless, says Nguyen, there’s an opportunity for many
customers to use a technology like SAS/Connect to find idle computer
resources and put them to work. “[SAS/Connect] enables the grid
computing technology by identifying the computers in the network and
going out there and using them,” he explains. “We’re offering this
to customers who have a need today, but we plan to evolve it and add
more intelligence to it within probably the next six to twelve
months, working with existing customers as well as potential
customers to really identify what features they most want to see.”
Where’s the Market? Market research firm
Insight Research recently projected that worldwide grid spending
will grow by almost 2000 percent over the next five years, from $250
million this year to almost $5 billion by 2008. Although no
projections are available, it’s likely that demand for BI or data
mining grid solutions will account for a very small percentage of
that total.
Still, SAS isn’t alone in talking up a potential convergence of
BI and grid computing. A couple of grid computing pure players—Avaki
Corp. and Platform Computing Inc.—have successfully marketed
gridbased BI solutions to Fortune 1000 stalwarts Pfizer Inc. and
Advanced Micro Devices Inc.
Avaki, for example, ships Data Grid 4.0, which it positions as a
data aggregation platform for distributed environments. More
precisely, says Craig Muzilla, vice president of marketing and
strategy with Avaki, Data Grid 4.0 is a mature solution for
enterprise information integration (EII). “We first came out with a
J2EEbased product in the fall of last year, and that focused on data
problems, [such as] how do you provision unstructured data or flat
file data across an organization,” he explains. “Now, with [Data
Grid] 4.0, we’ve added relational capabilities, so that you can set
up an SQL statement or a stored procedure and bring that data into
the grid, cache that, and do manipulation or aggregation of the
data.”
Why on earth would an organization choose to exploit grid
computing to further its EII efforts? For the simple reason, says
Muzilla, that grid vendors have already solved many of the security
and provisioning issues that the EII point players are only now
starting to tackle. “Using a grid, you can give local data owners
the chance to manage their resources without going to a central
administrator to manage security and provisioning.”
Excepting SAS, traditional BI players have been slow to warm to
grid computing. The opposite has been the case in the grid
community, where pure play Platform Computing last year established
an original equipment manufacturer (OEM) relationship with BI
powerhouse Cognos Inc., under the terms of which it agreed to OEM
Cognos’ PowerPlay OLAP tool along with Cognos’ Upfront portal.
Platform markets a grid solution for corporate performance
management (CPM) called Platform Intelligence.
Analysts are intrigued by a possible convergence of BI and grid
computing but suggest that grids aren’t ideal for most or even many
BI applications. Says Doug Laney, vice president and director of
technology research service with consultancy META Group: “The need
to distribute data and then hit the data hard with a lot of CPUs is
decidedly analytic in nature, but doesn’t really follow into an
operational scenario as much. If you look at the highly publicized
scenarios, it’s strictly for analytic purposes, but only certain
kinds of analytics lend themselves to being chunked like that.”
That’s the rub, says Mike Schiff, a principal with data
warehousing consultancy MAS Strategies. Computational grids grew out
of the academic and theoretical computing spaces, and haven’t caught
on as quickly for conventional business applications, which
typically deal with transactional or operational data rather than
static or very large datasets. As a result, Schiff says that BI
grids are a “future technology” that most shops aren’t seriously
evaluating right now.
SAS, for its part, claims that it has had some success selling BI
grids to non-traditional customers. For example, says Nguyen, a
major financial institution has deployed an SAS grid to manage
millions of credit card customers and to mine petabytes—yes,
petabytes—of historical data. Like the NIEHS, this financial
institution is searching for patterns, trends, or other anomalies
across literally years of historical information.
Nguyen believes that deep analysis on historical data is one
application that could broaden the appeal of BI grids. “What [an SAS
customer that is a] financial institution as well as the NIEHS is
doing is looking at years and years of data. They’re collecting back
to five years ago, trying to see if there are some trends, some
anomalies, things like that,” he explains. “Most of these customers
have terabytes of data, but I am anticipating that it will
eventually escalate to petabytes. It’s just not practical to keep
all of this [data] in a data warehouse.”
Even some grid advocates have their doubts, however. “I’m not
convinced that there are enormous unsolved warehousing problems, [I
think] that there’s less really relevant unsolved data problems than
people think,” says Brian MacDonald, a product marketing manager
with grid pure play Platform Computing. “I think that some people
believe that what they just need is an enormous data warehouse, and
if only they had better ETL tools that could help. It’s not clear
that there’s as much of a demand for that, although if you’re going
to do it, it would make sense to use a grid.”
META Group’s Laney believes that there’s a potential
market—albeit a small one—for BI grids of this kind. “It’s not
unreasonable to think that there’s a lot of untapped computing power
during the dark of the night, so if there’s a way to tap that, then
somebody is going to do it,” he agrees. “To the extent that some
analytic processes require background processing, non-interactive
processes that look for patterns, look for trends, those are the
kinds of solutions that lend themselves to grid computing.”
|
Stephen Swoyer is a technology writer based in Athens, GA.
swoyers@percipient-analytics.com
Next
Previous
Back
to Main |