Hadoop Cluster setup using Ansible 🎯

5 min readMar 22, 2021

👉Our Aim : Configure Hadoop and start cluster services using Ansible Playbook…..

🔑What is an Ansible ?

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows.

🔑What is Ansible PlayBook ?

An Ansible® playbook is a blueprint of automation tasks — which are complex IT actions executed with limited or no human involvement. Ansible playbooks are executed on a set, group, or classification of hosts, which together make up an Ansible inventory.

🔑What is Hadoop ?

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

Let’s have hands on the task now …..

We will be performing this task in RHEL8 booted on Virtual Box……

🟥Let’s first install Ansible.

Ansible is built in the top of Python …..Thus we will using pip to install Ansible

#pip install ansible

🟥Now we will be writing the Ansible Playbook , Excited!!!

👉Before Commencing ,jot down what is to be achieved…

1️⃣ Mounting the DVD on RHEL8

2️⃣ Configuring Yum Repo

3️⃣ JDK Installation for Hadoop

4️⃣ Hadoop Installation

5️⃣ NameNode Configuration

6️⃣ DataNode Configuration

7️⃣ Starting the Hadoop Daemon service

📀Lets see how we write yaml code for mounting DVD :(Namenode as well as Datanode)

- name : Mounting DVD
   mount :      
    src : "/dev/cdrom"      
    path : "/dvd"      
    state : mounted      
    fstype : "iso9660"

🎨Configuring Yum Repo :(Namenode as well as Datanode)

- name : Local Repo    
   yum_repository :      
     name : Repo1      
     description : "Local repo 1"      
     baseurl : "/dev/AppStream"      
     gpgcheck : 0   
- name : Local Repo setup    
   yum_repository :      
     name : Repo2      
     description : "Local repo 2"      
     baseurl : "/dev/BaseOS"      
     gpgcheck : 0

👉Installing JDK (JAVA) : (Namenode as well as Datanode)

- name : "Checking whether jdk exists or not"
    tags : jdk
    command : "rpm -q java | grep jdk1.8-2000.1.8.0_171-fcs.x86_64 | grep -v grep"
    changed_when : false
    ignore_errors : yes
    register : java_install_status- name : "JDK installation"
    tags : jdk
    command : "rpm -ivh jdk-8u171-linux-x64.rpm"
    ignore_errors : yes
    when : java_install_status.rc == 1

👉Installing Hadoop :(Namenode as well as Datanode):

- name : "Checking whether hadoop exists or not"
    tags : hadoop
    command : "rpm -q hadoop | grep hadoop-1.2.1-1.x86_64 | grep -v grep"
    changed_when : false
    ignore_errors : yes
    register : hadoop_status- name : "Hadoop installation"
    tags :  hadoop
    command : "rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force"
    ignore_errors: yes
    when : hadoop_status.rc == 1

👉Stopping firewall for connection purpose : (Namenode as well as Datanode)

- name : stop firewalld
    shell : "systemctl stop firewalld"
    ignore_errors : yes

🛠Name Node Configuration Hadoop :

👉 Creating directory in NameNode :

- hosts: namenode
  gather_facts : no
  vars_prompt:
  - name : namedir
    private : no
    prompt : "Enter namenode dir name e.g. /{dir}"  tasks: 
  - name : Creating namenode dir
    file :
      state : directory 
      path : "{{ namedir }}"
    ignore_errors: True
    register : directory

👉 Configuring hdfs-site.xml in /etc/hadoop/

- name : configuring hdfs-site.xml
    lineinfile: 
      path : "/etc/hadoop/hdfs-site.xml"
      insertafter : "<configuration>"
      line : "<property>\n\t <name>dfs.name.dir</name>\n
      \t <value>{{ namedir }}</value>\n
      </property>"

👉Configuring core-site.xml in /etc/hadoop/

- name : configuring core-site.xml
    lineinfile:
       path: "/etc/hadoop/core-site.xml"
       insertafter : "<configuration>"
       line : "<property>\n\t <name>fs.default.name</name>\n
       \t <value>hdfs://0.0.0.0:9001</value>\n
       </property>"

👉Formatting NameNode directory :

- name: formatting namenode dir   
    command : "echo Y | hadoop namenode -format -force"
    ignore_errors : yes
    when : directory.changed == true

👉 Starting hadoop-daemon namenode service :

- name : "Hadoop daemon service status check!"
    tags : namenode
    shell : "jps | grep NameNode | grep -v grep"
    ignore_errors : yes
    register : namenode_status- name: starting hadoop-daemon namenode service
    tags : namenode
    ignore_errors: yes
    shell : "hadoop-daemon.sh start namenode"
    when : namenode_status.rc == 1

🛠Data Node Configuration :

👉 Creating DataNode directory :

- hosts: datanode
  gather_facts : no
  vars_prompt:
  - name: datadir
    private : no
    prompt : "Enter datanode dir name e.g. /{dir}"tasks :
  - name: "Creating datanode dir"
    file:
      state : directory 
      path : "{{ datadir }}"    
    ignore_errors : True

👉 Configuring hdfs-site.xml :

- name : configuring hdfs-site.xml
    lineinfile:
      path : "/etc/hadoop/hdfs-site.xml"
      insertafter : "<configuration>"
      line : "<property>\n
      \t <name>dfs.data.dir</name>\n
      \t <value>{{ datadir }}</value>\n
      </property>"

👉Configuring core-site.xml

- name : configuring core-site.xml
    lineinfile:
       path: "/etc/hadoop/core-site.xml"
       insertafter : "<configuration>"
       line : "<property>\n\t
       <name>fs.default.name</name>\n\t
       <value>hdfs://{{ groups['namenode'][0] }}:9001</value>
       \n</property>"

👉Starting the DataNode :

- name : checking datanode daemon status
    tags : datanode
    shell : "jps | grep DataNode | grep -v grep"
    ignore_errors : yes
    register : datanode_status- name : starting data node daemon
    tags : datanode  
    command : "hadoop-daemon.sh start datanode"
    ignore_errors: yes 
    when : datanode_status.rc == 1

Here’s the github link for the playbook : Hadoop_Ansible

📚Lets see how its working now :

TO check syntax is fine or not : 
#ansible-playbook --syntax-check <playbook>.yml

To run playbook : ansible-playbook <playbook>.yml

JDK installation

Hadoop Installation and Namenode directory creation

Formatting name node dir and starting services

Bravo ! : NameNode is started and 1 DataNode is connected to it..

To check if namenode launched ?
#jpsTO list datanodes connected : 
#hadoop dfsadmin -report

Also the DataNode is configured and launched :

WEB UI View :

In your browser : TYPE
<namenodeip>:50070

That’s how we succeeded in configuring the Hadoop Cluster using Ansible…..

Ansible has made our task easy…Rather than configuring the cluster manually , automating it had made our task easy and in less time we can configure as many as desired clusters…

Keep Sharing ……🤗

Happy Learning …🧮