How To Setup Hadoop Cluster On AWS EC2

3 min readOct 25, 2020

What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

What is AWS?

Amazon Web Services (AWS) is a subsidiary of Amazon providing on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. These cloud computing web services provide a variety of basic abstract technical infrastructure and distributed computing building blocks and tools. One of these services is Amazon Elastic Compute Cloud (EC2), which allows users to have at their disposal a virtual cluster of computers, available all the time, through the Internet.

Let’s create Cluster:

✅ Launch one for name node and as much you want for DataNodes.

✅ Transfer Software to an instance.

✅ Install Software java and Hadoop.

✅ Configure one instance as name-node and the other as DataNode.

Step1: Launch Instances for cluster

Here I am launching instances with RHEL AMI and allow all traffic in the security Group.

Security Group- Rules

Step2: Transfer Software to an instance

Hadoop build using java so before installing Hadoop java is required to install here I am transferring both java1.8 and Hadoop to an instance using WinSCP.

Step3: Install Software java and Hadoop.

Step 4: Configure one instance as name-node and the other as DataNode.

Create a new directory for name node and data node where that store data.

created in NameNode

mkdir /nn

Created in Datanode

mkdir /dn

Configure hdfs-site.xml and core-site.xml files for both name-node and data-node.In this file, we have to write properties.

For NameNode

In core file hdfs://namenodeIp: port we have to add as value.

#Run this command to format namenode and start service
hadoop namenode -format
hadoop-deamon start namenode

For DatNode

hadoop-deamon.sh start datanode
#This command show you count of datanode connected to namenode and more information
hadoop dfsadmin -report

here we successfully have done with Hadoop cluster

Rohit Raut - Python - ARTH - The School of Technologies | LinkedIn

www.linkedin.com

Thank you for reading!!😇😇