Understanding Apache Hadoop

Apache Hadoop is an open-source software framework that provides massive data storage and distributed processing of large amounts of data. The Hadoop framework provides the tools needed to develop and run software applications.
Hadoop runs on a Linux operating system. Hadoop is available from either the Apache Software Foundation or from vendors that offer their own commercial Hadoop distributions such as Cloudera and Hortonworks.
You use Hadoop to store and process big data in a distributed fashion on large clusters of commodity hardware. Hadoop is an attractive technology for a number of reasons:
  • Hadoop provides a low-cost alternative for data storage.
  • HDFS is well suited for distributed storage and processing using commodity hardware. It is fault-tolerant, scalable, and simple to expand. HDFS manages files as blocks of equal size, which are replicated across the machines in a Hadoop cluster to provide fault tolerance.
  • Data replication provides failure resiliency.
Hadoop consists of a family of related components that are referred to as the Hadoop ecosystem. Hadoop provides many components such as the core components HDFS and MapReduce. In addition, Hadoop software and services providers (such as Cloudera and Hortonworks) provide additional proprietary software.