Process Data with a Scalable Apache Spark Cluster on Kubernetes

Introduction

Apache Spark is a popular open source analytics engine for processing large data sets. It plays well with Python, Java and R, all of which are commonly used languages in the field of data science, and it integrates with a wide range of data sources, including Apache Cassandra and HDFS.

Apache Spark works best as a distributed system and so, scalability is an important consideration when deploying an Apache Spark cluster. Kubernetes, which comes with easy scalability and scheduling features, makes an ideal companion for Apache Spark. By leveraging Kubernetes to scale out Apache Spark clusters, data scientists and engineers can achieve faster processing times and better resource utilization for their data processing requirements.

This guide walks you through the process of deploying an Apache Spark cluster on Kubernetes and submitting jobs to it. Bitnami's Apache Spark Helm chart makes this a quick and error-free process. By relying on Kubernetes for the cluster infrastructure, this approach avoids a single point of failure and makes it easier to scale out the cluster as more computing resources become necessary.

Assumptions and prerequisites

This guide makes the following assumptions:

Step 1: Create a shared volume for applications and data

Typically, your Apache Spark application and its dependencies are packaged in a Java ARchive (JAR) file. This file, together with any other related data, needs to be available to the Apache Spark worker nodes. One simple approach is to make the necessary files available to the worker nodes using a shared volume such as a Kubernetes PersistentVolumeClaim (PVC).

Therefore, the first step is to create a PVC and copy the required files into it. Further, since each worker node of the cluster will access the PVC, it is important to create the PVC using a storage class that supports ReadWriteMany access, such as NFS.

  • Begin by installing the NFS Server Provisioner. The easiest way to get this running on any platform is with the stable Helm chart. Use the command below, remembering to adjust the storage size to reflect your cluster's settings:

    helm repo add stable https://charts.helm.sh/stable
    helm install nfs stable/nfs-server-provisioner \
      --set persistence.enabled=true,persistence.size=5Gi
    
  • Create a Kubernetes manifest file named spark-pvc.yml to configure an NFS-backed PVC and a pod that uses it, as below:

    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
      name: spark-data-pvc
    spec:
      accessModes:
        - ReadWriteMany
      resources:
        requests:
          storage: 2Gi
      storageClassName: nfs
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: spark-data-pod
    spec:
      volumes:
        - name: spark-data-pv
          persistentVolumeClaim:
            claimName: spark-data-pvc
      containers:
        - name: inspector
          image: bitnami/minideb
          command:
            - sleep
            - infinity
          volumeMounts:
            - mountPath: "/data"
              name: spark-data-pv
    
  • Apply the manifest to the Kubernetes cluster:

    kubectl apply -f spark-pvc.yml
    

    This will create a pod named spark-data-pod with an attached PVC named spark-data-pvc. The PVC will be mounted at the /data mount point of the pod.

  • Create and prepare your application JAR file. This tutorial uses Apache Spark to count the occurrences of words in a text file. For this purpose, it uses the JavaWordCount application, which is one of the example applications included with Apache Spark.

    The JAR file containing the Apache Spark example applications can be obtained from the Apache Spark binary distribution, or from the Bitnami Apache Spark Docker image using the commands below.

    docker run -v /tmp:/tmp -it bitnami/spark -- find /opt/bitnami/spark/examples/jars/ -name spark-examples* -exec cp {} /tmp/my.jar \;
    

    This command mounts the /tmp directory on the Docker host within the running container and copies the JAR file with the Apache Spark example applications (including the JavaWordCount application) from the Docker container image to it, renaming it to my.jar.

  • If your application requires additional input or other files, create and prepare those files. The JavaWordCount example application used in this tutorial requires an input file containing one or more words. Create this file using the command below:

    echo "how much wood could a woodpecker chuck if a woodpecker could chuck wood" > /tmp/test.txt
    
  • Copy the JAR file containing the application, and any other required files, to the PVC using the mount point:

    kubectl cp /tmp/my.jar spark-data-pod:/data/my.jar
    kubectl cp /tmp/test.txt spark-data-pod:/data/test.txt
    
  • Verify that the data exists in the PVC, by connecting to the pod command-line shell and inspecting the /data directory:

    kubectl exec -it spark-data-pod -- ls -al /data
    

    The command output should display a directory listing containing the JAR file and the text file, as shown below:

PVC contents
  • Delete the pod, as it is not longer required:

    kubectl delete pod spark-data-pod
    

Step 2: Deploy Apache Spark on Kubernetes using the shared volume

The next step is to deploy Apache Spark on your Kubernetes cluster and configure it to use the PVC created in the previous step. Bitnami's Apache Spark Helm chart gives you a ready-to-use deployment with minimal effort. This chart includes additional configuration parameters that make it easy to mount and share a persistent volume between worker nodes.

  • Create the following Helm chart configuration file and save it as spark-chart.yml:

    service:
      type: LoadBalancer
    worker:
      replicaCount: 3
      extraVolumes:
        - name: spark-data
          persistentVolumeClaim:
            claimName: spark-data-pvc
      extraVolumeMounts:
        - name: spark-data
          mountPath: /data
    

    The worker.extraVolumes parameter specifies the extra volumes to be added to the Apache Spark worker deployment, while the worker.extraVolumeMounts parameter specifies the mount points for the volumes in each worker node.

  • Deploy Apache Spark on the Kubernetes cluster using the Bitnami Apache Spark Helm chart and supply it with the configuration file above:

    helm repo add bitnami https://charts.bitnami.com/bitnami
    helm install spark bitnami/spark -f spark-chart.yml
    

    This command creates a four-node Apache Spark cluster with one master node and three worker nodes. The Apache Spark cluster will be available for external access through the load balancer IP address and each worker node in the cluster will be able to access the PVC created in Step 1 using the /data mount point.

    Warning

    Using a LoadBalancer service type will typically assign a static IP address for the Apache Spark master. Depending on your cloud provider's policies, you may incur additional charges for this static IP address.

  • Wait for the deployment to complete and then run the command below to obtain the external IP address for use with Apache Spark:

    kubectl get svc -l "app.kubernetes.io/instance=spark,app.kubernetes.io/name=spark"
    
  • Browse to the load balancer IP and you should see the Apache Spark Web dashboard, which shows the status of workers and jobs, as shown below:

Spark dashboard
Warning

By default, Apache Spark has all its security disabled and you must secure the application yourself. Learn more about securing Apache Spark.

Step 3: Submit the application to Apache Spark

The next step is to submit the application to Apache Spark for processing. This is achieved by creating an Apache Spark client using the Bitnami Apache Spark container image and using the spark-submit script to submit the application.

Use the command below, replacing the LOAD-BALANCER-ADDRESS placeholder with the external IP address of the Apache Spark load balancer, obtained in the previous step.

kubectl run --namespace default spark-client --rm --tty -i --restart='Never' \
    --image docker.io/bitnami/spark:3.0.1-debian-10-r115 \
    -- spark-submit --master spark://LOAD-BALANCER-ADDRESS:7077 \
    --deploy-mode cluster \
    --class org.apache.spark.examples.JavaWordCount \
   /data/my.jar /data/test.txt

Here is an example of the output you will see:

Spark output

Navigate to the Apache Spark Web dashboard and confirm that the application was executed successfully:

Spark status

Step 4: View the output of the completed job

The output of the completed application is available on the worker node which executed it. Follow the steps below to view the output:

  • Navigate to the Apache Spark Web dashboard.
  • In the list of completed drivers, note the IP address and submission ID of the worker node that executed the application. This IP address is suffixed to the worker node name, as shown below:
Spark worker node
  • Execute the command below to identify the corresponding worker pod name. Replace the WORKER-NODE-ADDRESS placeholder with the worker node IP address obtained above:

    kubectl get pods -o wide | grep WORKER-NODE-ADDRESS
    
  • Start a new console session for the worker pod. Replace the WORKER-POD-NAME with the name of the pod obtained from the previous command.

    kubectl exec -it WORKER-POD-NAME -- bash
    
  • The output file is stored in the worker pod's /opt/bitnami/spark/work/SUBMISSION-ID/stdout directory. View it using the command below, replacing the SUBMISSION-ID placeholder with that obtained from the Apache Spark Web dashboard:

    cd /opt/bitnami/spark/work
    cat SUBMISSION-ID/stdout
    

Here is an example of what you should see:

Spark job results

This tutorial used one of Apache Spark's built-in examples to demonstrate how to submit and process a simple word counting task on an Apache Spark cluster deployed with the Bitnami Helm chart. As illustrated, Bitnami's Apache Spark Helm chart makes deploying the cluster a quick and simple process, allowing you to focus your efforts and time on business logic rather than deployment configuration.

Useful links

To learn more about the topics discussed in this article, use the links below: