Deploy your Bitnami Hadoop Stack on Bitnami Cloud Hosting now! Launch Now

Hadoop for Bitnami Cloud Hosting

Description

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

First steps with the Bitnami Hadoop Stack

Welcome to your new Bitnami application running on Bitnami Cloud Hosting! Here are a few questions (and answers!) you might need when first starting with your application.

What credentials do I need?

You need two sets of credentials:

  • The application credentials that allow you to log in to your new Bitnami application. These credentials consist of a username and password.
  • The server credentials that allow you to log in to your Bitnami Cloud Hosting server using an SSH client and execute commands on the server using the command line. These credentials consist of an SSH username and key.

What is the administrator username set for me to log in to the application for the first time?

Username: user

What SSH username should I use for secure shell access to my application?

SSH username: bitnami

What are the default ports?

A port is an endpoint of communication in an operating system that identifies a specific process or a type of service. Bitnami stacks include several services or servers that require a port.

IMPORTANT: Making this application's network ports public is a significant security risk. You are strongly advised to only allow access to those ports from trusted networks. If, for development purposes, you need to access from outside of a trusted network, please do not allow access to those ports via a public IP address. Instead, use a secure channel such as a VPN or an SSH tunnel. Follow these instructions to remotely connect safely and reliably.

Port 22 is the default port for SSH connections.

Bitnami opens some ports for the main servers. These are the ports opened by default: 80, 443.

What is the default configuration?

The stack provides a web panel to control the status of Hadoop. To access it, follow these steps:

Once inside authenticated, you will see a page like this:

WebUI

Hadoop configuration files

The Hadoop configuration files are located at /opt/bitnami/hadoop/etc/hadoop, and the most relevant ones are:

  • /opt/bitnami/hadoop/etc/hadoop/hadoop-env.sh and /opt/bitnami/hadoop/etc/hadoop/yarn-env.sh: Configuration options for the scripts found in /opt/bitnami/hadoop/bin.
  • /opt/bitnami/hadoop/etc/hadoop/APP-site.xml: Site-specific configuration for each Hadoop service.

Find more details about how to configure Hadoop settings in Hadoop's official documentation.

Hadoop log files

The Hadoop log files for the specific services are created in the following directories:

  • /opt/bitnami/hadoop/logs: NameNode, Secondary NameNode, DataNode, Timeline server, History server, NodeManager and Resource Manager.
  • /opt/bitnami/hadoop/hive/logs: Derby, HiveServer2, Metastore and WebHCat.
  • /opt/bitnami/hadoop/hive/hcatalog/var/log: HCatalog.
  • /opt/bitnami/hadoop/pig/logs: Pig.

Hadoop ports

Each daemon in Hadoop listens to a different port. The most relevant ones are:

  • ResourceManager:
    • Scheduler: 8030.
    • Resource Tracker: 8031.
    • Service: 8032.
    • Web UI: 8088.
  • NodeManager:
    • Localizer: 8040.
    • Web UI: 8042.
  • Timeline Server:
    • Service: 10200.
    • Web UI: 8188.
  • History Server:
    • Service: 10020
    • Admin: 10033.
    • Web UI: 19888.
  • NameNode:
    • Service: 8020.
    • Web UI: 9870.
  • Secondary NameNode:
    • Web UI: 9868.
  • DataNode:
    • Data Transfer: 9866.
    • Service: 9867.
    • Web UI: 9864.
  • Hive:
    • Derby DB: 1527.
    • HCat/Metastore: 9083.
    • Hiveserver2 Thrift: 10000.
    • Hiveserver2 Web UI: 10002.
    • WebHCat: 50111.

All ports are closed by default. In order to access any of them, you have two options:

  • (Recommended) Create an SSH tunnel for accessing the port (refer to the FAQ for more information about SSH tunnels).
  • Open the port for remote access (refer to the FAQ for more information about opening ports).

How to start or stop the services?

Each Bitnami stack includes a control script that lets you easily stop, start and restart services. The script is located at /opt/bitnami/ctlscript.sh. Call it without any service name arguments to start all services:

$ sudo /opt/bitnami/ctlscript.sh start

Or use it to restart a single service, such as Apache only, by passing the service name as argument:

$ sudo /opt/bitnami/ctlscript.sh restart apache

Use this script to stop all services:

$ sudo /opt/bitnami/ctlscript.sh stop

Restart the services by running the script without any arguments:

$ sudo /opt/bitnami/ctlscript.sh restart

Obtain a list of available services and operations by running the script without any arguments:

$ sudo /opt/bitnami/ctlscript.sh

How to access the administration panel?

Access the administration panel by browsing to http://SERVER-IP/cluster/.

How to create a full backup of Hadoop?

Backup

The Bitnami Hadoop Stack is self-contained and the simplest option for performing a backup is to copy or compress the Bitnami stack installation directory. To do so in a safe manner, you will need to stop all servers, so this method may not be appropriate if you have people accessing the application continuously.

Follow these steps:

  • Change to the directory in which you wish to save your backup:

      $ cd /your/directory
    
  • Stop all servers:

      $ sudo /opt/bitnami/ctlscript.sh stop
    
  • Create a compressed file with the stack contents:

      $ sudo tar -pczvf application-backup.tar.gz /opt/bitnami
    
  • Restart all servers:

      $ sudo /opt/bitnami/ctlscript.sh start
    

You should now download or transfer the application-backup.tar.gz file to a safe location.

Restore

Follow these steps:

  • Change to the directory containing your backup:

      $ cd /your/directory
    
  • Stop all servers:

      $ sudo /opt/bitnami/ctlscript.sh stop
    
  • Move the current stack to a different location:

      $ sudo mv /opt/bitnami /tmp/bitnami-backup
    
  • Uncompress the backup file to the original directoryv

      $ sudo tar -pxzvf application-backup.tar.gz -C /
    
  • Start all servers:

      $ sudo /opt/bitnami/ctlscript.sh start
    

If you want to create only a database backup, refer to these instructions for MySQL and PostgreSQL.

How to create an SSL certificate?

OpenSSL is required to create an SSL certificate. A certificate request can then be sent to a certificate authority (CA) to get it signed into a certificate, or if you have your own certificate authority, you may sign it yourself, or you can use a self-signed certificate (because you just want a test certificate or because you are setting up your own CA).

Follow the steps below:

  • Generate a new private key:

     $ sudo openssl genrsa -out /opt/bitnami/apache2/conf/server.key 2048
    
  • Create a certificate:

     $ sudo openssl req -new -key /opt/bitnami/apache2/conf/server.key -out /opt/bitnami/apache2/conf/cert.csr
    
    IMPORTANT: Enter the server domain name when the above command asks for the "Common Name".
  • Send cert.csr to the certificate authority. When the certificate authority completes their checks (and probably received payment from you), they will hand over your new certificate to you.

  • Until the certificate is received, create a temporary self-signed certificate:

     $ sudo openssl x509 -in /opt/bitnami/apache2/conf/cert.csr -out /opt/bitnami/apache2/conf/server.crt -req -signkey /opt/bitnami/apache2/conf/server.key -days 365
    
  • Back up your private key in a safe location after generating a password-protected version as follows:

     $ sudo openssl rsa -des3 -in /opt/bitnami/apache2/conf/server.key -out privkey.pem
    

    Note that if you use this encrypted key in the Apache configuration file, it will be necessary to enter the password manually every time Apache starts. Regenerate the key without password protection from this file as follows:

     $ sudo openssl rsa -in privkey.pem -out /opt/bitnami/apache2/conf/server.key
    

Find more information about certificates at http://www.openssl.org.

How to enable HTTPS support with SSL certificates?

TIP: If you wish to use a Let's Encrypt certificate, you will find specific instructions for enabling HTTPS support with Let's Encrypt SSL certificates in our Let's Encrypt guide.
NOTE: The steps below assume that you are using a custom domain name and that you have already configured the custom domain name to point to your cloud server.

Bitnami images come with SSL support already pre-configured and with a dummy certificate in place. Although this dummy certificate is fine for testing and development purposes, you will usually want to use a valid SSL certificate for production use. You can either generate this on your own (explained here) or you can purchase one from a commercial certificate authority.

Once you obtain the certificate and certificate key files, you will need to update your server to use them. Follow these steps to activate SSL support:

  • Use the table below to identify the correct locations for your certificate and configuration files.

    Variable Value
    Current application URL https://[custom-domain]/
      Example: https://my-domain.com/ or https://my-domain.com/appname
    Apache configuration file /opt/bitnami/apache2/conf/bitnami/bitnami.conf
    Certificate file /opt/bitnami/apache2/conf/server.crt
    Certificate key file /opt/bitnami/apache2/conf/server.key
    CA certificate bundle file (if present) /opt/bitnami/apache2/conf/server-ca.crt
  • Copy your SSL certificate and certificate key file to the specified locations.

    NOTE: If you use different names for your certificate and key files, you should reconfigure the SSLCertificateFile and SSLCertificateKeyFile directives in the corresponding Apache configuration file to reflect the correct file names.
  • If your certificate authority has also provided you with a PEM-encoded Certificate Authority (CA) bundle, you must copy it to the correct location in the previous table. Then, modify the Apache configuration file to include the following line below the SSLCertificateKeyFile directive. Choose the correct directive based on your scenario and Apache version:

    Variable Value
    Apache configuration file /opt/bitnami/apache2/conf/bitnami/bitnami.conf
    Directive to include (Apache v2.4.8+) SSLCACertificateFile "/opt/bitnami/apache2/conf/server-ca.crt"
    Directive to include (Apache < v2.4.8) SSLCertificateChainFile "/opt/bitnami/apache2/conf/server-ca.crt"
    NOTE: If you use a different name for your CA certificate bundle, you should reconfigure the SSLCertificateChainFile or SSLCACertificateFile directives in the corresponding Apache configuration file to reflect the correct file name.
  • Once you have copied all the server certificate files, you may make them readable by the root user only with the following commands:

     $ sudo chown root:root /opt/bitnami/apache2/conf/server*
    
     $ sudo chmod 600 /opt/bitnami/apache2/conf/server*
    
  • Open port 443 in the server firewall. Refer to the FAQ for more information.

  • Restart the Apache server.

You should now be able to access your application using an HTTPS URL.

How to force HTTPS redirection with Apache?

Add the following to the top of the /opt/bitnami/hadoop/conf/httpd-prefix.conf file:

RewriteEngine On
RewriteCond %{HTTPS} !=on
RewriteRule ^/(.*) https://%{SERVER_NAME}/$1 [R,L]

After modifying the Apache configuration files, restart Apache to apply the changes.

How to debug Apache errors?

Once Apache starts, it will create two log files at /opt/bitnami/apache2/logs/access_log and /opt/bitnami/apache2/logs/error_log respectively.

  • The access_log file is used to track client requests. When a client requests a document from the server, Apache records several parameters associated with the request in this file, such as: the IP address of the client, the document requested, the HTTP status code, and the current time.

  • The error_log file is used to record important events. This file includes error messages, startup messages, and any other significant events in the life cycle of the server. This is the first place to look when you run into a problem when using Apache.

If no error is found, you will see a message similar to:

Syntax OK

How to upload files to the server with SFTP?

Although you can use any SFTP/SCP client to transfer files to your server, the link below explains how to configure FileZilla (Windows, Linux and Mac OS X), WinSCP (Windows) and Cyberduck (Mac OS X). It is required to use your server's private SSH key to configure the SFTP client properly. Choose your preferred application and follow the steps in the link below to connect to the server through SFTP.

How to upload files to the server

How to connect to Hadoop from a different machine?

For security reasons, ports used by Hadoop cannot be accessed over a public IP address. To connect to Hadoop from a different machine, you must the open port of the service you want to access remotely. Refer to the FAQ for more information on this.

Check the Hadoop ports section to see the complete list of the most relevant ports in Hadoop.

IMPORTANT: Making this application's network ports public is a significant security risk. You are strongly advised to only allow access to those ports from trusted networks. If, for development purposes, you need to access from outside of a trusted network, please do not allow access to those ports via a public IP address. Instead, use a secure channel such as a VPN or an SSH tunnel. Follow these instructions to remotely connect safely and reliably.

How to create a Hadoop cluster with several servers?

It is possible to create a Hadoop cluster with several instances of Bitnami Hadoop stack, as long as Hadoop daemons are properly configured.

Typical Hadoop clusters are divided into the following node roles:

  • Master nodes: NameNodes and ResourceManager servers, usually running one of these services per node.
  • Worker nodes: Acting as both DataNode and NodeManager on a same node.
  • Service nodes: Services such as Application Timeline server, Web App Proxy server and MapReduce Job History server running on a same node.
  • Client: Where Hadoop jobs will be submitted from, which will have Hadoop Hive installed.

Once you have decided an architecture for your cluster, the Hadoop services running on each node must be able to communicate with each other. Each service operates on different ports. Therefore, when creating the cluster, ensure that you open the service ports on each node.

IMPORTANT: Hadoop will require you to use hostnames/IP addresses that are configured via network configuration to your server. This typically means that you won't be able to use a public IP address, but a private IP address instead.
  • Stop all the services in the nodes by running the following command in each node:

      $ sudo /opt/bitnami/ctlscript.sh stop
    
  • NameNode: Save the IP address of the node that will act as the NameNode. In this example, we will suppose that the IP address of the chosen NameNode is 192.168.1.2.
    • Change the fs.defaultFS property in the /opt/bitnami/hadoop/etc/hadoop/core-site.xml file, and set its value to the full HDFS URI to the node which will act as the NameNode:

      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.1.2:8020</value>
      </property>
      
    • Change the value of the dfs.namenode.http-address property in /opt/bitnami/hadoop/etc/hadoop/hdfs-site.xml to include the proper IP address:

      <property>
        <name>dfs.namenode.http-address</name>
        <value>192.168.1.2:9870</value>
      </property>
      
  • Secondary NameNode: Change the dfs.namenode.secondary.http-address property in the /opt/bitnami/hadoop/etc/hadoop/hdfs-site.xml file. Set the value to the appropriate IP address for the Secondary NameNode. For example, if the IP address of the chosen Secondary NameNode were the same as the one for the NameNode, the configuration file will contain the following:

    <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>192.168.1.2:9868</value>
    </property>
    
  • ResourceManager: Add the property yarn.resourcemanager.hostname to the /opt/bitnami/hadoop/etc/hadoop/yarn-site.xml file. Set the value to the IP address of the node which will act as the ResourceManager. For example, if the IP address of the chosen ResourceManager is 192.168.1.3, the configuration file will contain the following:

    <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>192.168.1.3</value>
    </property>
    
  • DataNode: Edit the dfs.datanode.address, dfs.datanode.http.address and dfs.datanode.ipc.address properties in the /opt/bitnami/hadoop/hdfs-site.xml file. Set the values to the IP address and port of the node which will act as the DataNode. For example, if the IP address of the chosen DataNode server is 192.168.1.4 and it listens to the default port, the configuration file will contain the following:

    <property>
      <name>dfs.datanode.address</name>
      <value>192.168.1.4:9866</value>
    </property>
    <property>
      <name>dfs.datanode.http.address</name>
      <value>192.168.1.4:9864</value>
    </property>
    <property>
      <name>dfs.datanode.ipc.address</name>
      <value>192.168.1.4:9867</value>
    </property>
    
  • NodeManager: Add the property yarn.nodemanager.hostname to the /opt/bitnami/hadoop/etc/hadoop/yarn-site.xml file. Set the value to the IP address of the node which will act as the NodeManager. For example, if the IP address of the chosen NodeManager is 192.168.1.2, the same as the DataNode, the configuration file will contain the following:

    <property>
      <name>yarn.nodemanager.hostname</name>
      <value>192.168.1.4</value>
    </property>
    
  • Timeline server: Add the property yarn.timeline-service.hostname to the /opt/bitnami/hadoop/etc/hadoop/yarn-site.xml file. Set the value to the IP address of the node which will act as the Timeline server. For example, if the IP address of the chosen Timeline server is 192.168.1.5 and it listens to the default port, the configuration file will contain the following:

    <property>
      <name>yarn.timeline-service.hostname</name>
      <value>192.168.1.5</value>
    </property>
    
  • JobHistory server: Edit the mapreduce.jobhistory.address, mapreduce.jobhistory.admin.address and mapreduce.jobhistory.webapp.address properties in the /opt/bitnami/hadoop/mapred-site.xml file. Set the values to the IP address of the node which will act as the JobHistory server, and the corresponding ports for each service. For example, if the IP address of the chosen JobHistory server is 192.168.1.5 and the services listen to the default ports, the configuration file will contain the following:

    <property>
      <name>mapreduce.jobhistory.address</name>
      <value>192.168.1.5:10020</value>
    </property>
    <property>
      <name>mapreduce.jobhistory.admin.address</name>
      <value>192.168.1.5:10033</value>
    </property>
    <property>
      <name>mapreduce.jobhistory.webapp.address</name>
      <value>192.168.1.5:19888</value>
    </property>
    
  • Copy these configuration files to every node in the cluster. The configuration must be the same in all of them.
  • Configure start/stop scripts: Once you have decided the architecture and applied configuration files, you must disable unnecessary services in each of the nodes. To do so:
    • Navigate to /opt/bitnami/hadoop/scripts on each server and determine if any of the startup scripts are not needed. If that is the case, rename them to something different. For instance, in order to disable all Yarn services, run the following command:

      $ sudo mv ctl-yarn.sh ctl-yarn.sh.disabled
      
    • If you want to selectively disable some of daemons for a specific service, you must edit the appropriate start/stop script and look for the HADOOP_SERVICE_DAEMONS line and remove the ones you want in the list. For instance, if you wanted to selectively disable Yarn's NodeManager, you would change it from this:

      HADOOP_YARN_DAEMONS="$HADOOP_YARN_RESURCENODE_CLASS $HADOOP_YARN_NODEMANAGER_CLASS"
      

      To this:

      HADOOP_YARN_DAEMONS="$HADOOP_YARN_RESURCENODE_CLASS"
      
  • Disable Apache in the necessary nodes where it is not necessary: Apache is only used as a proxy to Yarn ResourceManager. If your server does not run Yarn ResourceManager it can be disabled:

      $ sudo mv /opt/bitnami/apache2/scripts/ctl.sh /opt/bitnami/scripts/ctl.sh.disabled
    
  • On each node, start the services:

      $ sudo /opt/bitnami/ctlscript.sh start
    

How to run a test job in Hadoop?

You can run jobs in Hadoop from the same computer where it is installed.

Hadoop bundles many examples that you can try. For instance, there is an example for obtaining an estimation of the Pi number's value. You can check it by running the following command:

$ hadoop jar /opt/bitnami/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 10 100

When the job finishes, you will see an output similar to:

Estimated value of Pi is 3.14800000000000000000

In order to show a MapReduce example involving HDFS, you can try running the following commands for showing how many words begin with the letter "c" in a simple tongue twister:

$ echo "can you can a can as a canner can can a can" | hadoop fs -put - /tmp/hdfs-example-input
$ hadoop jar /opt/bitnami/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep /tmp/hdfs-example-input /tmp/hdfs-example-output 'c[a-z]+'
$ hadoop fs -cat /tmp/hdfs-example-output/part-r-00000

You will see the job resulted in the following output:

6       can
1       canner

How to connect to Hive?

The Bitnami Hadoop Stack includes Hive, Pig and Spark, and starts HiveServer2, Metastore and WebHCat by default.

How to connect to HiveServer2?

HiveServer2 is a server interface that enables remote clients to execute queries against Hive and retrieve the results. It listens to port 10000 by default.

In order to connect to HiveServer2, you have two options:

  • (Recommended): Connect to the HiveServer2 Thrift server (running on port 10000) through an SSH tunnel (refer to the FAQ for more information about SSH tunnels).
  • Open the HiveServer2 Thrift server's port 10000 for remote access (refer to the FAQ for more information about opening ports).

Once you have connected to the server through an SSH tunnel or you opened the port to allow the remote access, you can use the Beeline command-line utility. To connect to HiveServer2 using Beeline, run the following:

  • Connecting to HiveServer2 through an SSH tunnel:

    $ beeline -u jdbc:hive2://localhost:10000 -n hadoop
    
  • Connecting to HiveServer2 by opening the port 10000. (SERVER-IP is a placeholder, please replace it with the right value).

    $ beeline -u jdbc:hive2://SERVER-IP:10000
    

After some seconds, you will be able to access the prompt:

0: jdbc:hive2://localhost:10000>
How to access the HiveServer2 Web UI?

HiveServer2 has a Web UI which provides different features, such as logging, metrics and configuration information. It listens on port 10002. In order to access it, you have two options:

  • (Recommended) Access the HiveServer2 Web UI (running on port 10002) through an SSH tunnel (refer to the FAQ for more information about SSH tunnels).
  • Open the HiveServer2 port 10002 for remote access (refer to the FAQ for more information about opening ports).

How to access WebHCat?

HCatalog is a table and storage management layer for Hadoop. HCatalog is built on top of Metastore, another component of Hadoop. WebHCat is the REST API for HCatalog, and listens to port 50111 by default.

In order to access HCatalog, you have two options:

  • (Recommended): Access the WebHCat server (running on port 50111) through an SSH tunnel (refer to the FAQ for more information about SSH tunnels).
  • Open the WebHCat port 50111 for remote access (refer to the FAQ for more information about opening ports).

You can access WebHCat with the following commands:

  • Connecting to HCatalog through an SSH tunnel:

    $ curl -s 'http://localhost:50111/templeton/v1/status?user.name=hadoop'
    
  • Connecting to HCatalog by opening the port 50111 (SERVER-IP is a placeholder, please replace it with the right value).

    $ curl -s 'http://SERVER-IP:50111/templeton/v1/status?user.name=hadoop'
    

You should see the following output:

{"version":"v1","status":"ok"}'

How to connect to Pig?

The Bitnami Hadoop Stack includes Pig, a platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

To use Pig, simply run:

$ pig

After a few moments, you will see the grunt prompt:

grunt>

How to run Pig tutorial scripts?

In order to run the Pig tutorial scripts, you will first need to upload a file to HDFS:

$ hadoop fs -copyFromLocal /opt/bitnami/hadoop/pig/tutorial/data/excite.log.bz2 .

In this case we will run script1-hadoop.pig, which you can then run as following:

$ cd /opt/bitnami/hadoop/pig/tutorial
$ pig ./scripts/script1-hadoop.pig

The process takes some minutes, but once it finishes, you will find some output similar to the following indicating success:

Input(s):
Successfully read 944954 records (10409092 bytes) from: "hdfs://localhost:8020/user/hadoop/excite.log.bz2"

Output(s):
Successfully stored 13530 records (659954 bytes) in: "hdfs://localhost:8020/user/hadoop/script1-hadoop-results"

(...)

2018-02-20 09:11:36,947 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2018-02-20 09:11:36,976 [main] INFO  org.apache.pig.Main - Pig script completed in 3 minutes, 21 seconds and 433 milliseconds (201433 ms)

How to connect to Spark shell?

Spark provides a shell, which allows you to run Spark interactively. You can access the Spark shell with the following command:

$ spark-shell

After some seconds, you will see the prompt:

scala>

How to run Spark examples?

The Bitnami Hadoop Stack includes Spark, a fast and general-purpose cluster computing system. Spark includes several example programs.

In order to run Spark examples, you must use the run-example program. In order to estimate a value for Pi, you can run the following test:

$ run-example SparkPi 10

After some seconds, Spark will output the result:

2018-02-15 12:37:29,410 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.697757 s
Pi is roughly 3.1426351426351427
bch

Bitnami Documentation