Deploy your Bitnami Hadoop Stack on 1&1 Cloud Platform now! Launch Now

Bitnami Hadoop for 1&1 Cloud Platform

Description

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

First steps with the Bitnami Hadoop Stack

Welcome to your new Bitnami application running on 1&1! Here are a few questions (and answers!) you might need when first starting with your application.

What credentials do I need?

You need two sets of credentials:

  • The application credentials, consisting of a username and password. These credentials allow you to log in to your new Bitnami application.

  • The server credentials, consisting of an SSH username and password. These credentials allow you to log in to your 1&1 Cloud Platform server using an SSH client and execute commands on the server using the command line.

What is the administrator username set for me to log in to the application for the first time?

Username: user

What is the administrator password?

What SSH username should I use for secure shell access to my application?

SSH username: root

How do I get my SSH key or password?

What are the default ports?

A port is an endpoint of communication in an operating system that identifies a specific process or a type of service. Bitnami stacks include several services or servers that require a port.

Remember that if you need to open some ports you can follow the instructions given in the FAQ to learn how to open the server ports for remote access.

Port 22 is the default port for SSH connections.

Bitnami opens some ports for the main servers. These are the ports opened by default: 80, 443.

What is the default configuration?

The stack provides a web panel to control the status of Hadoop. To access it, you must authenticate with a username of user. Bitnami images have been built to configure a random password during the first boot.

WebUI

The default data directory in Bitnami is located at /opt/bitnami/hadoop/data

Hadoop version

In order to see which Hadoop version is running, you can execute the following:

$ hadoop version

Hadoop configuration files

The Hadoop configuration files are located at /opt/bitnami/hadoop/etc/hadoop.

Hadoop has several global and per-service configuration files. The most relevant ones are:

  • /opt/bitnami/hadoop/etc/hadoop/hadoop-env.sh and /opt/bitnami/hadoop/etc/hadoop/yarn-env.sh: configuration options for the scripts found in /opt/bitnami/hadoop/bin.
  • /opt/bitnami/hadoop/etc/hadoop/-site.xml*: site-specific configuration for each Hadoop service.

Find more details about how to configure your setting in Hadoop's official documentation.

Hadoop ports

Each daemon in Hadoop listens to a different port. The most relevant ones are:

  • ResourceManager:
    • Service: 8032
    • Web UI: 8088
  • NameNode:
    • Metadata: 9000
    • Web UI: 50070
  • Secondary NameNode:
    • Metadata: 50090
  • DataNode:
    • Data transfer: 50010
    • Metadata: 50020
    • WebUI: 50075
  • Timeline Server:
    • Service: 10200
    • WebUI: 8188
  • Hive:
    • Hiveserver2 binary: 10000
    • Hiveserver2 HTTP: 10001
    • Metastore: 9083
    • WebHCat: 50111
    • Derby DB: 1527

All ports are closed by default.

Hadoop log file

The Hadoop log files for each service are created at /opt/bitnami/hadoop/logs.

How to start or stop the services?

Each Bitnami stack includes a control script that lets you easily stop, start and restart services. The script is located at /opt/bitnami/ctlscript.sh. Call it without any service name arguments to start all services:

$ sudo /opt/bitnami/ctlscript.sh start

Or use it to restart a single service, such as Apache only, by passing the service name as argument:

$ sudo /opt/bitnami/ctlscript.sh restart apache

Use this script to stop all services:

$ sudo /opt/bitnami/ctlscript.sh stop

Restart the services by running the script without any arguments:

$ sudo /opt/bitnami/ctlscript.sh restart

Obtain a list of available services and operations by running the script without any arguments:

$ sudo /opt/bitnami/ctlscript.sh

How to access the administration panel?

Access the administration panel by browsing to http://SERVER-IP/cluster/.

How to create a full backup of Hadoop?

Backup

The Bitnami Hadoop Stack is self-contained and the simplest option for performing a backup is to copy or compress the Bitnami stack installation directory. To do so in a safe manner, you will need to stop all servers, so this method may not be appropriate if you have people accessing the application continuously.

Follow these steps:

  • Change to the directory in which you wish to save your backup:

      $ cd /your/directory
    
  • Stop all servers:

      $ sudo /opt/bitnami/ctlscript.sh stop
    
  • Create a compressed file with the stack contents:

      $ sudo tar -pczvf application-backup.tar.gz /opt/bitnami
    
  • Restart all servers:

      $ sudo /opt/bitnami/ctlscript.sh start
    

You should now download or transfer the application-backup.tar.gz file to a safe location.

Restore

Follow these steps:

  • Change to the directory containing your backup:

      $ cd /your/directory
    
  • Stop all servers:

      $ sudo /opt/bitnami/ctlscript.sh stop
    
  • Move the current stack to a different location:

      $ sudo mv /opt/bitnami /tmp/bitnami-backup
    
  • Uncompress the backup file to the original directoryv

      $ sudo tar -pxzvf application-backup.tar.gz -C /
    
  • Start all servers:

      $ sudo /opt/bitnami/ctlscript.sh start
    

If you want to create only a database backup, refer to these instructions for MySQL and PostgreSQL.

How to enable HTTPS support with SSL certificates?

NOTE: The steps below assume that you are using a custom domain name and that you have already configured the custom domain name to point to your cloud server.

Bitnami images come with SSL support already pre-configured and with a dummy certificate in place. Although this dummy certificate is fine for testing and development purposes, you will usually want to use a valid SSL certificate for production use. You can either generate this on your own (explained here) or you can purchase one from a commercial certificate authority.

Once you obtain the certificate and certificate key files, you will need to update your server to use them. Follow these steps to activate SSL support:

  • Use the table below to identify the correct locations for your certificate and configuration files.

    Variable Value
    Current application URL https://[custom-domain]/
      Example: https://my-domain.com/ or https://my-domain.com/appname
    Apache configuration file /opt/bitnami/apache2/conf/bitnami/bitnami.conf
    Certificate file /opt/bitnami/apache2/conf/server.crt
    Certificate key file /opt/bitnami/apache2/conf/server.key
    CA certificate bundle file (if present) /opt/bitnami/apache2/conf/server-ca.crt
  • Copy your SSL certificate and certificate key file to the specified locations.

    NOTE: If you use different names for your certificate and key files, you should reconfigure the SSLCertificateFile and SSLCertificateKeyFile directives in the corresponding Apache configuration file to reflect the correct file names.
  • If your certificate authority has also provided you with a PEM-encoded Certificate Authority (CA) bundle, you must copy it to the correct location in the previous table. Then, modify the Apache configuration file to include the following line below the SSLCertificateKeyFile directive. Choose the correct directive based on your scenario and Apache version:

    Variable Value
    Apache configuration file /opt/bitnami/apache2/conf/bitnami/bitnami.conf
    Directive to include (Apache v2.4.8+) SSLCACertificateFile "/opt/bitnami/apache2/conf/server-ca.crt"
    Directive to include (Apache < v2.4.8) SSLCertificateChainFile "/opt/bitnami/apache2/conf/server-ca.crt"
    NOTE: If you use a different name for your CA certificate bundle, you should reconfigure the SSLCertificateChainFile or SSLCACertificateFile directives in the corresponding Apache configuration file to reflect the correct file name.
  • Once you have copied all the server certificate files, you may make them readable by the root user only with the following commands:

     $ sudo chown root:root /opt/bitnami/apache2/conf/server*
    
     $ sudo chmod 600 /opt/bitnami/apache2/conf/server*
    
  • Open port 443 in the server firewall. Refer to the FAQ for more information.

  • Restart the Apache server.

You should now be able to access your application using an HTTPS URL.

How to create an SSL certificate?

OpenSSL is required to create an SSL certificate. A certificate request can then be sent to a certificate authority (CA) to get it signed into a certificate, or if you have your own certificate authority, you may sign it yourself, or you can use a self-signed certificate (because you just want a test certificate or because you are setting up your own CA).

Follow the steps below:

  • Generate a new private key:

     $ sudo openssl genrsa -out /opt/bitnami/apache2/conf/server.key 2048
    
  • Create a certificate:

     $ sudo openssl req -new -key /opt/bitnami/apache2/conf/server.key -out /opt/bitnami/apache2/conf/cert.csr
    
    IMPORTANT: Enter the server domain name when the above command asks for the "Common Name".
  • Send cert.csr to the certificate authority. When the certificate authority completes their checks (and probably received payment from you), they will hand over your new certificate to you.

  • Until the certificate is received, create a temporary self-signed certificate:

     $ sudo openssl x509 -in /opt/bitnami/apache2/conf/cert.csr -out /opt/bitnami/apache2/conf/server.crt -req -signkey /opt/bitnami/apache2/conf/server.key -days 365
    
  • Back up your private key in a safe location after generating a password-protected version as follows:

     $ sudo openssl rsa -des3 -in /opt/bitnami/apache2/conf/server.key -out privkey.pem
    

    Note that if you use this encrypted key in the Apache configuration file, it will be necessary to enter the password manually every time Apache starts. Regenerate the key without password protection from this file as follows:

     $ sudo openssl rsa -in privkey.pem -out /opt/bitnami/apache2/conf/server.key
    

Find more information about certificates at http://www.openssl.org.

How to force HTTPS redirection?

Add the following to the top of the /opt/bitnami/hadoop/conf/httpd-prefix.conf file:

RewriteEngine On
RewriteCond %{HTTPS} !=on
RewriteRule ^/(.*) https://%{SERVER_NAME}/$1 [R,L]

After modifying the Apache configuration files, restart Apache to apply the changes.

How to debug Apache errors?

Once Apache starts, it will create two log files at /opt/bitnami/apache2/logs/access_log and /opt/bitnami/apache2/logs/error_log respectively.

  • The access_log file is used to track client requests. When a client requests a document from the server, Apache records several parameters associated with the request in this file, such as: the IP address of the client, the document requested, the HTTP status code, and the current time.

  • The error_log file is used to record important events. This file includes error messages, startup messages, and any other significant events in the life cycle of the server. This is the first place to look when you run into a problem when using Apache.

If no error is found, you will see a message similar to:

Syntax OK

How to upload files to the server with SFTP?

Although you can use any SFTP/SCP client to transfer files to your server, the link below explains how to configure FileZilla (Windows, Linux and Mac OS X), WinSCP (Windows) and Cyberduck (Mac OS X). It is required to use your server's private SSH key to configure the SFTP client properly. Choose your preferred application and follow the steps in the link below to connect to the server through SFTP.

How to upload files to the server

How to connect to Hadoop from a different machine?

For security reasons, the Hadoop ports in this solution cannot be accessed over a public IP address. To connect to Hadoop from a different machine, you must the open port of the service you want to access remotely. Refer to the FAQ for more information on this.

Check the What is the default configuration section to see the complete list of the most relevant ports in Hadoop.

IMPORTANT: Making this application's network ports public is a significant security risk. You are strongly advised to only allow access to those ports from trusted networks. If, for development purposes, you need to access from outside of a trusted network, please do not allow access to those ports via a public IP address. Instead, use a secure channel such as a VPN or an SSH tunnel. Follow these instructions to remotely connect safely and reliably.

How to run a test job in Hadoop?

You can run jobs in Hadoop from the same computer where it is installed. For example, to run the pi example bundled with Hadoop run the following command:

$ hadoop jar /opt/bitnami/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 10 100

When the job finishes, you will see an output similar to:

Estimated value of Pi is 3.14800000000000000000

To run an example involving HDFS, you can run the following commands:

$ hadoop fs -mkdir /user
$ hadoop fs -mkdir /user/hadoop
$ hadoop fs -put /opt/bitnami/hadoop/etc/hadoop /user/hadoop/input
$ hadoop jar /opt/bitnami/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep /user/hadoop/input /user/hadoop/output 'dfs[a-z.]+'
$ hadoop fs -cat /user/hadoop/output/part-r-00000

How to test Hive?

The Bitnami Hadoop Stack includes Hive, Pig and Spark, and starts Hiveserver2, Metastore and WebHCat by default.

Hiveserver2

Hiveserver2 is a server interface that enables remote clients to execute queries against Hive and retrieve the results. It listens to port 10000 by default.

In order to connect to Hiveserver2, you have two options:

  • (Recommended): Access Hiveserver2 through an SSH tunnel (refer to the FAQ for more information about SSH tunnels).
  • Open the port 10000 for remote access (refer to the FAQ for more information about opening ports).

Once you have connected to the server through an SSH tunnel or you opened the port to allow the remote access, you can use the Beeline command-line utility. To connect to Hiveserver2 using Beeline, run the following:

  • Connecting to Hiveserver2 through an SSH tunnel:

    $ beeline -u jdbc:hive2://127.0.0.1:10000 -n hadoop
    

    You will see an output similar to this:

    Connecting to jdbc:hive2://localhost:10000
    16/02/22 18:57:13 INFO jdbc.Utils: Supplied authorities: localhost:10000
    16/02/22 18:57:13 INFO jdbc.Utils: Resolved authority: localhost:10000
    16/02/22 18:57:14 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000
    Connected to: Apache Hive (version 1.2.1)
    Driver: Spark Project Core (version 1.6.0)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Beeline version 1.6.0 by Apache Hive
    0: jdbc:hive2://localhost:10000>
    
  • Connecting to Hiveserver2 by opening the port 10000. (SERVER-IP is a placeholder, please replace it with the right value).

    $ beeline -u jdbc:hive2://SERVER-IP:10000
    

Metastore and WebHCat

HCatalog is a table and storage management layer for Hadoop. HCatalog is built on top of Metastore, another component of Hadoop. WebHCat is the REST API for HCatalog. HCatalog listens to port 50111 by default.

In order to connect to HCatalog, you have two options:

  • (Recommended): Access HCatalog through an SSH tunnel (refer to the FAQ for more information about SSH tunnels).
  • Open the port 50111 for remote access (refer to the FAQ for more information about opening ports).

You can test WebHCat with the following commands:

  • Connecting to HCatalog through an SSH tunnel:

    $ curl -s 'http://127.0.0.1:50111/templeton/v1/status?user.name=hadoop'
    
  • Connecting to HCatalog by opening the port 50111 (SERVER-IP is a placeholder, please replace it with the right value).

    $ curl -s 'http://SERVER-IP:50111/templeton/v1/status?user.name=hadoop'
    

You will get this reply as the output:

{"version":"v1","status":"ok"}'

How to test Pig?

The Bitnami Hadoop Stack includes Pig, a platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

To use Pig, simply run:

$ pig

After a few moments, you will see the grunt prompt:

grunt>

You can test your Pig installation by running the following command:

$ pig /opt/bitnami/hadoop/pig/tutorial/scripts/script1-hadoop.pig

You will see an output similar to the following:

16/02/22 18:58:21 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
16/02/22 18:58:21 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
16/02/22 18:58:21 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2016-02-22 18:58:21,844 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2016-02-22 18:58:21,844 [main] INFO  org.apache.pig.Main - Logging error messages to: /opt/hadoop-2.7.1-3/hadoop/pig/logs/pig_1456167501843.log
2016-02-22 18:58:23,964 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2016-02-22 18:58:24,770 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2016-02-22 18:58:24,770 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-02-22 18:58:24,770 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2016-02-22 18:58:28,513 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-02-22 18:58:28,519 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: file './tutorial.jar' does not exist.
Details at logfile: /opt/hadoop-2.7.1-3/hadoop/pig/logs/pig_1456167501843.log
2016-02-22 18:58:29,578 [main] INFO  org.apache.pig.Main - Pig script completed in 7 seconds and 466 milliseconds (7466 ms)

How to test Spark?

The Bitnami Hadoop Stack includes Spark, a fast and general-purpose cluster computing system. Spark includes several example programs.

You can test your Spark installation with the following command:

$ run-example SparkPi 10

You will see some messages:

16/02/22 19:05:19 INFO spark.SparkContext: Running Spark version 1.6.0
16/02/22 19:05:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/22 19:05:21 INFO spark.SecurityManager: Changing view acls to: root
16/02/22 19:05:21 INFO spark.SecurityManager: Changing modify acls to: root
16/02/22 19:05:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(ro
ot)
16/02/22 19:05:22 INFO util.Utils: Successfully started service 'sparkDriver' on port 51306.
16/02/22 19:05:23 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/02/22 19:05:23 INFO Remoting: Starting remoting
16/02/22 19:05:24 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.17.0.2:48109]
16/02/22 19:05:24 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 48109.
16/02/22 19:05:24 INFO spark.SparkEnv: Registering MapOutputTracker
16/02/22 19:05:24 INFO spark.SparkEnv: Registering BlockManagerMaster
16/02/22 19:05:24 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-7078d4a0-86bb-42c2-88a7-235f5861e6f8
16/02/22 19:05:24 INFO storage.MemoryStore: MemoryStore started with capacity 511.1 MB
16/02/22 19:05:24 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/02/22 19:05:24 INFO server.Server: jetty-8.y.z-SNAPSHOT

After a while, Spark will output the result:

16/02/22 19:05:30 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 2.396073 s
Pi is roughly 3.141816
oneone