vmwarehadoop

Create a Hadoop cluster

It is possible to create a Hadoop cluster with several instances of Bitnami Hadoop stack, as long as Hadoop daemons are properly configured.

Typical Hadoop clusters are divided into the following node roles:

  • Master nodes: NameNodes and ResourceManager servers, usually running one of these services per node.
  • Worker nodes: Acting as both DataNode and NodeManager on a same node.
  • Service nodes: Services such as Application Timeline server, Web App Proxy server and MapReduce Job History server running on a same node.
  • Client: Where Hadoop jobs will be submitted from, which will have Hadoop Hive installed.

Once you have decided an architecture for your cluster, the Hadoop services running on each node must be able to communicate with each other. Each service operates on different ports. Therefore, when creating the cluster, ensure that you open the service ports on each node.

IMPORTANT: Hadoop will require you to use hostnames/IP addresses that are configured via network configuration to your server. This typically means that you won’t be able to use a public IP address, but a private IP address instead.

  • Stop all the services in the nodes by running the following command in each node:

    $ sudo /opt/bitnami/ctlscript.sh stop
    
  • NameNode: Save the IP address of the node that will act as the NameNode. In this example, we will suppose that the IP address of the chosen NameNode is 192.168.1.2.

    • Change the fs.defaultFS property in the /opt/bitnami/hadoop/etc/hadoop/core-site.xml file, and set its value to the full HDFS URI to the node which will act as the NameNode:

      <property>
      <name>fs.defaultFS</name>
      <value>hdfs://192.168.1.2:8020</value>
      </property>
      
  • Change the value of the dfs.namenode.http-address property in /opt/bitnami/hadoop/etc/hadoop/hdfs-site.xml to include the proper IP address:

    <property>
      <name>dfs.namenode.http-address</name>
      <value>192.168.1.2:9870</value>
    </property>
    
  • Secondary NameNode: Change the dfs.namenode.secondary.http-address property in the /opt/bitnami/hadoop/etc/hadoop/hdfs-site.xml file. Set the value to the appropriate IP address for the Secondary NameNode. For example, if the IP address of the chosen Secondary NameNode were the same as the one for the NameNode, the configuration file will contain the following:

    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>192.168.1.2:9868</value>
    </property>
    
  • ResourceManager: Add the property yarn.resourcemanager.hostname to the /opt/bitnami/hadoop/etc/hadoop/yarn-site.xml file. Set the value to the IP address of the node which will act as the ResourceManager. For example, if the IP address of the chosen ResourceManager is 192.168.1.3, the configuration file will contain the following:

    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>192.168.1.3</value>
    </property>
    
  • DataNode: Edit the dfs.datanode.address, dfs.datanode.http.address and dfs.datanode.ipc.address properties in the /opt/bitnami/hadoop/hdfs-site.xml file. Set the values to the IP address and port of the node which will act as the DataNode. For example, if the IP address of the chosen DataNode server is 192.168.1.4 and it listens to the default port, the configuration file will contain the following:

    <property>
        <name>dfs.datanode.address</name>
        <value>192.168.1.4:9866</value>
    </property>
    <property>
        <name>dfs.datanode.http.address</name>
        <value>192.168.1.4:9864</value>
    </property>
    <property>
        <name>dfs.datanode.ipc.address</name>
        <value>192.168.1.4:9867</value>
    </property>
    
  • NodeManager: Add the property yarn.nodemanager.hostname to the /opt/bitnami/hadoop/etc/hadoop/yarn-site.xml file. Set the value to the IP address of the node which will act as the NodeManager. For example, if the IP address of the chosen NodeManager is 192.168.1.2, the same as the DataNode, the configuration file will contain the following:

    <property>
        <name>yarn.nodemanager.hostname</name>
        <value>192.168.1.4</value>
    </property>
    
  • Timeline server: Add the property yarn.timeline-service.hostname to the /opt/bitnami/hadoop/etc/hadoop/yarn-site.xml file. Set the value to the IP address of the node which will act as the Timeline server. For example, if the IP address of the chosen Timeline server is 192.168.1.5 and it listens to the default port, the configuration file will contain the following:

    <property>
        <name>yarn.timeline-service.hostname</name>
        <value>192.168.1.5</value>
    </property>
    
  • JobHistory server: Edit the mapreduce.jobhistory.address, mapreduce.jobhistory.admin.address and mapreduce.jobhistory.webapp.address properties in the /opt/bitnami/hadoop/mapred-site.xml file. Set the values to the IP address of the node which will act as the JobHistory server, and the corresponding ports for each service. For example, if the IP address of the chosen JobHistory server is 192.168.1.5 and the services listen to the default ports, the configuration file will contain the following:

    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>192.168.1.5:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.admin.address</name>
        <value>192.168.1.5:10033</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>192.168.1.5:19888</value>
    </property>
    
  • Copy these configuration files to every node in the cluster. The configuration must be the same in all of them.

  • Configure start/stop scripts: Once you have decided the architecture and applied configuration files, you must disable unnecessary services in each of the nodes. To do so:

    • Navigate to /opt/bitnami/hadoop/scripts on each server and determine if any of the startup scripts are not needed. If that is the case, rename them to something different. For instance, in order to disable all Yarn services, run the following command:

      $ sudo mv ctl-yarn.sh ctl-yarn.sh.disabled
      
    • If you want to selectively disable some of daemons for a specific service, you must edit the appropriate start/stop script and look for the HADOOP_SERVICE_DAEMONS line and remove the ones you want in the list. For instance, if you wanted to selectively disable Yarn’s NodeManager, you would change it from this:

      HADOOP_YARN_DAEMONS="$HADOOP_YARN_RESURCENODE_CLASS $HADOOP_YARN_NODEMANAGER_CLASS"
      

      To this:

      HADOOP_YARN_DAEMONS="$HADOOP_YARN_RESURCENODE_CLASS"
      
  • Disable Apache in the necessary nodes where it is not necessary: Apache is only used as a proxy to Yarn ResourceManager. If your server does not run Yarn ResourceManager it can be disabled:

    $ sudo mv /opt/bitnami/apache2/scripts/ctl.sh /opt/bitnami/scripts/ctl.sh.disabled
    
  • On each node, start the services:

    $ sudo /opt/bitnami/ctlscript.sh start
    
Last modification September 6, 2018