Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.
In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System.
Benefits of using Hadoop
The architecture of Hadoop allows you to scale your hardware as and when you need to. New nodes can be added incrementally without having to worry about the change in data formats or the handling of applications that sit on the file system.
One of the most important features of Hadoop is that it allows you to save enormous amounts of money by substituting cheap commodity servers for expensive ones. This is possible because Hadoop transfers the responsibility of fault tolerance from the hardware layer to the application layer.
Ubuntu 16.04 LTS Xenial Xerus 64-bit Server Edition Hadoop 2.7.2
The report here will describe the required steps for setting up a single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux. Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
Before we start, we will understand the meaning of the following:
A DataNode stores data in the Hadoop File System. A functional file system has more than one DataNode, with the data replicated across them.
The NameNode is the centrepiece of an HDFS file system. It keeps the directory of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these file itself.
The NodeManager (NM) is YARN's per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers' life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log's management and auxiliary services which may be exploited by different YARN applications.
ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).
Secondary Namenode whole purpose is to have a checkpoint in HDFS. It is just a helper node for namenode.
Hadoop requires Java to be installed, so let's begin by installing Java, update the source list:
[email protected]:~$ sudo apt-get update
The OpenJDK project is the default version of Java, that is provided from a supported Ubuntu repository.
[email protected]:~$ sudo apt-get install openjdk-8-jdk
These commands will update the package information on your server and then install Java. After executing these commands, execute the following command to verify that Java has been installed:
[email protected]:~$ java -version
If Java has been installed, this should display the version details as illustrated in the following output:
openjdk version "1.8.0_01-internal" OpenJDK Runtime Environment (build 1.8.0_01-internal-b15) OpenJDK 64-Bit Server VM (build 25.01-b15, mixed mode)
If you already have Java JDK installed on your system, then you need not run the above command.
Adding a dedicated Hadoop system user
[email protected]:~$ sudo addgroup hadoop_group [email protected]:~$ sudo adduser --ingroup hadoop_group hduser1
This will add the user
hduser1 and the group
hadoop_group to the local machine. Add
hduser1 to the
[email protected]:~$ sudo adduser hduser1 sudo
The hadoop control scripts rely on SSH to peform cluster-wide operations. For example, there is a script for stopping and starting all the daemons in the clusters. To work seamlessly, SSh needs to be etup to allow password-less login for the hadoop user from machines in the cluster. The simplest ay to achive this is to generate a public/private key pair, and it will be shared across the cluster.
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the earlier.
We have to generate an SSH key for the
[email protected]:~$ su - hduser1 [email protected]:~$ ssh-keygen -t rsa -P ''
-P '', here indicates an empty password
You have to enable SSH access to your local machine with this newly created key which is done by the following command:
[email protected]:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to the local machine with the
hduser1 user. The step is also needed to save your local machine's host key fingerprint to the
hduser1 user's known hosts file.
[email protected]:~$ ssh localhost
If the SSH connection fails, we can try the following (optional):
- Enable debugging with
ssh -vvv localhostand investigate the error in detail.
- Check the SSH server configuration in
/etc/ssh/sshd_config. If you made any changes to the SSH server configuration file, you can force a configuration reload with
sudo /etc/init.d/ssh reload.
Start by switching to
[email protected]:~$ su - hduser1
Download and extract last version Hadoop binary code from hadoop.apache.org. I use current stable version Hadoop 2.7.2.
$ wget http://ftp.tc.edu.tw/pub/Apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz $ sudo tar -xvzf hadoop-*.tar.gz -C /usr/local/ && sudo mv /usr/local/hadoop-* /usr/local/hadoop
Setup Environment Variables for Hadoop
Add the following entries to
Set Hadoop-related environment variables:
# Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Make change to take effect:
$ source ~/.bashrc
Change the file:
to in the same file
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
Now we create the directory and set the required ownerships and permissions
[email protected]:~$ sudo mkdir -p /app/hadoop/tmp [email protected]:~$ sudo chown hduser1:hadoop_group /app/hadoop/tmp [email protected]:~$ sudo chmod 750 /app/hadoop/tmp
The last line gives reading and writing permissions to the
Error: If you forget to set the required ownerships and permissions, you will see a java.io.IO Exception when you try to format the name node.
Paste the following between
<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
mapred-site.xml file from template:
[email protected]:~$ sudo cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml && sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>
<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>
Formatting the HDFS filesystem via the NameNode
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable). Run the command
[email protected]:~$ hdfs namenode -format
Starting single-node cluster
Run the command
[email protected]:~$ start-dfs.sh && start-yarn.sh
This will startup a Namenode, DataNode, ResourceManager and a NodeManager on the machine. Verify this by typing in the following command:
[email protected]:~$ jps 4704 NameNode 5573 Jps 5015 SecondaryNameNode 5160 ResourceManager 5273 NodeManager 4842 DataNode
If you can see a result similar to the depicted in the screenshot above, it means that you now have a functional instance of Hadoop running on your server. The output means that we now have a functional instance of Hadoop running on our server.
stop-dfs.sh && stop-yarn.sh (or
stop-all.sh) to stop all the daemons running on our machine:
[email protected]:~$ stop-dfs.sh && stop-yarn.sh
Hadoop Web Interfaces
http://localhost:50070 - web UI of the NameNode daemon
http://localhost:50090 - web UI of the status
http://localhost:50075 - web UI of the data node daemon
- The masters and slaves file should contain localhost.
/etc/hosts, the ip of the system should be given with the alias as
- Set the java home path in
Related Article Setup Hadoop on Ubuntu (Multi-Node Cluster)