Apache Hadoop is an open-source software framework that provides massive data storage and distributed
processing of large amounts of data. The
Hadoop framework provides the tools needed to develop and run software applications.
Hadoop runs on a Linux operating system. Hadoop is available from either the Apache
Software
Foundation or from vendors that offer their own commercial Hadoop distributions such
as Cloudera and Hortonworks.
You use Hadoop to store and process
big data in a distributed fashion on large clusters of commodity hardware. Hadoop is an attractive
technology for a number of reasons:
-
Hadoop provides a low-cost alternative for data storage.
-
HDFS is well suited for distributed storage and processing using commodity hardware. It
is fault-tolerant, scalable, and simple to expand. HDFS manages files as blocks of
equal size, which are replicated across the machines in a Hadoop
cluster to provide fault tolerance.
-
Data replication provides failure
resiliency.
Hadoop consists of a family of related components that are referred to as the Hadoop
ecosystem.
Hadoop provides many components such as the core components HDFS and
MapReduce. In addition, Hadoop software and services providers (such as Cloudera and Hortonworks)
provide additional proprietary software.