Build a Kafka, Spark and Solr Data Analytics Platform using Deployment Blueprints

Introduction

Enterprise applications rely on large amounts of data that needs to be distributed, processed, and stored. Data platforms offer data management services via a combination of open source and commercially supported software stacks. These services enable accelerated development and deployment of data-hungry business applications.

Building a containerized data analytics platform comprising different software stacks comes with several deployment challenges. For example, the challenge of stitching together, sizing, and placing multiple data software runtimes across Kubernetes nodes. For a better understanding of these challenges and how they are solved by utilizing deployment blueprints, check out this guide.

The data platform deployment blueprints available in the Bitnami Application Catalog assist with the building of the most optimal, resilient data platforms by covering the following:

  • Pod placement rules: Affinity rules to ensure placement diversity to prevent single point of failures and optimize load distribution
  • Pod resource sizing rules: Optimized Pod and JVM sizing settings for optimal performance and efficient resource usage
  • Default settings: To ensure Pod access security
  • Optional Tanzu Observability framework configuration: Out of the box dashboards to monitor the data platform
  • Use of curated, trusted Bitnami images: To make the data platform secure

In addition, all the blueprints are validated and tested to provide Kubernetes node count and sizing recommendations in order to facilitate cloud platform capacity planning. The goal is to optimize the server and storage footprint to minimize infrastructure cost.

Bitnami's Data Platform Blueprint1 with Kafka-Spark-Solr enables the fully automated deployment of a multi-stack data platform in a multi-node Kubernetes cluster by covering the following software components:

  • Apache Kafka: Data distribution bus with buffering capabilities
  • Apache Spark: In-memory data analytics
  • Apache Solr: Data persistence and search

This guide walks you through the steps to deploy this blueprint on a Kubernetes Cluster.

Assumptions and prerequisites

This guide makes the following assumptions:

  • You have a multi-node Kubernetes cluster (Kubernetes 1.12+) with 1 control plane (best-effort-small) and 3 worker nodes (best-effort-xlarge) running with Helm 3.1.0 installed.
  • You have the kubectl command line (kubectl CLI) installed and configured to work with your cluster.
  • You have a Docker environment installed and configured. Learn more about installing Docker.
  • You have PersistentVolume (PV) provisioner support in the underlying infrastructure and support for ReadWriteMany volumes for deployment scaling.
  • You have a Tanzu Observability Cluster up and running (only required when enabling the Tanzu Observability framework for the data platform).

Deploy the Data Platform Blueprint 1 on Kubernetes

Tip

This guide uses the VMware Tanzu Kubernetes Guest Cluster as the underlying infrastructure for the deployment of the data platform.

This guide considers the use case of building a data platform with Kafka-Spark-Solr, which could be used for data and application evaluation, development, and functional testing.

In order to build this data platform, the Kubernetes cluster should have the following configuration:

  • 1 Master Node (2 CPU, 4Gi Memory)
  • 3 Worker Nodes (4 CPU, 32Gi Memory)

Deployment Architecture

The diagram below depicts the deployment architecture.

Blueprint architecture

Below is the list of Kubernetes objects deployed by this chart:

  • Apache Zookeeper with 3 nodes to be used for both Apache Kafka and Apache Solr
  • Apache Kafka with 3 nodes using the Apache Zookeeper deployment above
  • Apache Solr with 2 nodes using the Apache Zookeeper deployment above
  • Apache Spark with 1 Master and 2 worker nodes

Use the commands below to deploy the Kafka-Spark-Solr Data Platform Blueprint on Kubernetes:

# Add Bitnami Helm repo
helm repo add bitnami https://charts.bitnami.com/bitnami

# Run repo update in case you already have added Bitnami helm repo.
helm repo update

# Install the chart in default namespace
helm install dp bitnami/dataplatform-bp1

A resilient and optimal small data platform comprising of Kafka-Spark-Solr is up and running on the Kubernetes cluster within minutes using a single command!

Deploy the Data Platform Blueprint 1 with Tanzu Observability Framework

Tanzu Observability is an enterprise-grade observability solution. Bitnami Data Platform Blueprint 1 with Kafka-Spark-Solr has an out-of-the-box integration with Tanzu Observability that is turned off by default, but can be enabled via certain parameters.

Once you have enabled the observability framework, you can use the out-of-the-box dashboards to monitor your data platform, from viewing the health and utilization of the data platform at a high level to taking an in-depth view of each software stack runtime.

Deployment Architecture with Tanzu Observability Framework

The diagram below depicts the Data Platform architecture with the Tanzu Observability framework.

Blueprint+Observability architecture

Below is the list of Kubernetes objects that this chart deploys:

  • Apache Zookeeper with 3 nodes to be used for both Apache Kafka and Apache Solr
  • Apache Kafka with 3 nodes using the Apache Zookeeper deployment above
  • Apache Solr with 2 nodes using the Apache Zookeeper deployment above
  • Apache Spark with 1 master and 2 worker nodes
  • Wavefront Collector DaemonSet to enable runtime feed into the Tanzu Observability service.
  • Wavefront Proxy
Warning

Ensure that you have a Tanzu Observability cluster up and running before starting the deployment. The API token for your Tanzu Observability cluster user can be fetched from the "User -> Settings -> Username -> API Access" menu path in the Tanzu Observability user interface.

Proceed with the deployment using the commands shown below. Replace the CLUSTER-NAME, CLUSTER-URL-PREFIX and API-TOKEN placeholders with the cluster name, Tanzu Observability cluster URL prefix on wavefront.com and API token respectively.

# Install the chart with TO framework.
helm install my-release bitnami/dataplatform-bp1 \
    --set kafka.metrics.kafka.enabled=true \
    --set kafka.metrics.jmx.enabled=true \
    --set spark.metrics.enabled=true \
    --set solr.exporter.enabled=true \
    --set solr.authentication.enabled=false
    --set wavefront.enabled=true \
    --set wavefront.clusterName=CLUSTER-NAME \
    --set wavefront.wavefront.url=https://CLUSTER-URL-PREFIX.wavefront.com \
    --set wavefront.wavefront.token=API-TOKEN
Warning

The Solr exporter is not supported when deploying Apache Solr with authentication enabled, and is therefore disabled in the previous command.

Once the data platform is up and running, log in to the Tanzu Observability user interface and navigate to the "Integrations -> Big Data" page to find the "Data Platforms" tile shown in the image below:

Wavefront dashboard

Click the "Data Platforms" tile and navigate to the "Dashboards" tab. Click the "Data Platform Blueprint1 Kafka-Spark-Solr Dashboard" link to view the dashboard.

Wavefront dashboard

Here are the important sections of the dashboard:

  • Overview: This section provides a single view of the data platform cluster, including Health, Utilization, and Applications, which together form the cluster.
Overview section
  • Kubernetes platform: This section provides a detailed view of the underlying Kubernetes cluster, specifically the node-to-pod mapping so you can understand the placement of application pods on Kubernetes cluster nodes.
Kubernetes sectionKubernetes section
  • Individual applications: This section provides a detailed view of the individual application metrics. In this case, it displays detailed metrics for Apache Kafka, Apache Spark and Apache Solr.

    Apache Kafka:

Kafka application details

Apache Spark:

Spark application details

Apache Solr:

Solr application details

Conclusion

As illustrated, you can build a data platform with Kafka-Spark-Solr together with the Tanzu Observability framework on the Kubernetes cluster, within minutes, using a single command.

The Bitnami Data Platform Blueprint 1 Helm chart makes this a quick, easy, and secure process, allowing you to focus your time and effort on business logic rather than a deployment configuration that uses multistep runbooks. The best part is, you have out-of-the-box dashboards available in Tanzu Observability to monitor your data platform from Day 1.

This tutorial is part of the series

Build a Data Analytics Platform with Bitnami Deployment Blueprints

Learn how deployment blueprints are a practical solution to simplify the complexity that accompanies new data analytics platform deployments.