Index and Query Data with a Scalable Bitnami Apache Solr Deployment on Kubernetes

Apache Solr is a powerful and flexible open-source search engine based on Lucene. It supports full-text search and geospatial search, and it can accept input data for indexing using standard formats like XML, JSON and CSV. It is also highly configurable and scalable, making it ideal for use in enterprise environments.

The quickest way to deploy Apache Solr for enterprise use is with Bitnami's Apache Solr Helm chart for Kubernetes. This chart deploys the most recent and secure version of Apache Solr on a Kubernetes cluster using the Helm package manager. It pre-configures the deployment in line with current security and scalability best practices, and allows significant additional customizations through a large number of optional chart parameters.

This article walks you through the process of deploying Apache Solr on Kubernetes using the Bitnami Apache Solr Helm chart. It also shows you how to use the deployed Apache Solr installation to index and query a data set using Apache Solr's built-in tools.

Assumptions and prerequisites

This article assumes that:

Step 1: Deploy Apache Solr on Kubernetes

Follow the steps below to deploy Apache Solr on Kubernetes with Bitnami's Apache Solr Helm chart:

  • Add the Bitnami chart repository to Helm:

    helm repo add bitnami https://charts.bitnami.com/bitnami
    
  • Execute the following command to deploy Apache Solr. Replace the ADMIN-PASSWORD placeholder with a custom password for the Apache Solr administrator account.

    helm install solr bitnami/solr \
      --set authentication.adminPassword=ADMIN-PASSWORD
    

    Wait for a few minutes until the chart is deployed. You can also use the command below to check the status of the deployment:

    kubectl get pods | grep solr
    
  • By default, the Apache Solr service is available at a cluster IP address and default port 8983. Forward this service port to the host port 8983. This allows you to access the Apache Solr service via port 8983 of the kubectl host.

    kubectl port-forward --address=0.0.0.0 svc/solr 8983:8983 &
    

Step 2: Index the data

Once the Apache Solr service is running, the next step is to add data to the Apache Solr index. By default, Apache Solr accepts common document formats such as XML, JSON and CSV. Data can be added to Solr using the command-line post tool, via the Web interface or via a client.

To illustrate the process, this tutorial indexes the World Bank dataset for life expectancy at birth. This dataset is available in various formats under the CC BY-4.0 license. The Bitnami Apache Solr container image is used to execute the indexing process; however, if you already have an Apache Solr environment or the Apache Solr post command-line tool, you can use that instead and skip the Docker commands below.

  • Begin by requesting and downloading the dataset in CSV format:

    cd /tmp
    wget -O dataset.zip https://api.worldbank.org/v2/en/indicator/SP.DYN.LE00.IN?downloadformat=csv
    
  • The requested data is packaged as a ZIP archive. Uncompress this archive:

    unzip dataset.zip
    

    This produces a set of output files. One of them is a CSV file named (for example) API_SP.DYN.LE00.IN_DS2_en_csv_v2_2445326.csv. The filename may vary on your system.

  • Inspect the CSV file and note the following unique features of the data set. These features must be taken into account when indexing the data in the file.

    • The first four lines of the file do not contain indexable data.
    • The fifth line contains the field names for the CSV data.
    • The data includes 40+ data points for each country.
  • Use the following Docker commands to create and start a Bitnami Apache Solr container on your host

    docker create -v $(pwd):/app -e SOLR_PORT_NUMBER=8984 -t --net="host" --name mysolr bitnami/solr:latest
    docker start mysolr
    

    The -v argument to the first command tells Docker to mount the host's current directory into the container's /app path, so that the effects of commands run in the container are seen on the host. The --net="host" parameter tells Docker to use the host's network stack for the container. The container is named mysolr.

    Once the container starts, connect to the container console with the command below. This will give you a command shell and allow you to use the Apache Solr tools available in the image for subsequent tasks.

    docker exec -it mysolr /bin/bash
    
  • Use the post command to index the data in the CSV file:

    /opt/bitnami/solr/bin/post -host localhost -port 8983 -c my-collection -user admin:ADMIN-PASSWORD -params "keepEmpty=true&skipLines=5&fieldnames=CountryName,CountryCode,IndicatorName,IndicatorCode,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020," /app/API_SP.DYN.LE00.IN_DS2_en_csv_v2_2445326.csv
    

    The post command takes the following parameters:

    • The -host and -port parameters define the Apache Solr host and port to be used for the index. As described previously, the Apache Solr service in the Kubernetes cluster has been forwarded to port 8983 of the host, and the container has been configured to use the host's network stack.
    • The -user parameter specifies the credentials to use for the Apache Solr service. Replace the ADMIN-PASSWORD placeholder with the administrator account password defined in Step 1.
    • The -c parameter defines the collection which will host the indexed data.
    • The -params parameter contains a list of Solr-specific parameters that control how the data is indexed. In particular, the skipLines parameter tells the indexing tool to start indexing from line 6 of the CSV file (as the first 5 lines do not contain any data points) and the fieldNames parameter manually specifies the field names.
    • The final parameter to the command is the CSV file to index. This file is available in the container at the /app mount point. Remember to update the file name as needed.
    Tip

    Learn more about the post command-line tool.

    Warning

    It is also possible to begin indexing from line 5 of the CSV file and have the indexing tool automatically identify and set field names based on the column headers in line 5. In practice, this failed in some cases, as the indexing tool would incorrectly assign a Long integer data type instead of a Double data type for some fields causing a data type mismatch error. The approach shown above attempts to work around this problem by manually specifying the field names.

Once the indexing process is complete, Apache Solr will display a success message, as shown below:

Indexing results

Step 3: Query the indexed data

Once the data has been indexed, it is possible to execute queries on the collection to answer specific questions. This may be done either via the Apache Solr Web interface or with the curl command-line tool. This section primarily focuses on the curl tool.

Tip

Learn more about Apache Solr query syntax.

A number of queries are possible, but this tutorial restricts itself to two examples. The first is simple: obtain all the life expectancy data for a single country (for example, India) using it country code. Use the following command, replacing the ADMIN-PASSWORD placeholder with the correct password for your Apache Solr deployment.

curl "http://admin:ADMIN-PASSWORD@localhost:8983/solr/my-collection/select?indent=on&q=CountryCode:IND"

Here is an example of what you should see:

Query results

For a slightly more advanced example, try to identify the country with the highest life expectancy in 2018. This is a two-step process, where the first step is to obtain the range of life expectancy values in 2018, as shown below:

curl "http://admin:ADMIN-PASSWORD@localhost:8983/solr/my-collection/select?indent=on&q=*:*&stats=true&stats.field=2018&rows=0"

Here is an example of the output:

Query results

From the above, it may be seen that the highest life expectancy in 2018 was 84.93 years. The second step, therefore, is to find the country with this value in 2018, as shown below:

curl "http://admin:ADMIN-PASSWORD@localhost:8983/solr/my-collection/select?indent=on&q=2018:84.9341463414634&fl=CountryName"

Here is the output of the query, identifying Hong Kong as the matching country:

Query results

It is also possible to log in to the Apache Solr Web interface, by browsing to http://localhost:8983/solr/ and performing the query through the Web interface. The image below displays the results of the previous query, executed through the Apache Solr Web interface:

Query results in Web interface

Useful links

To learn more about the topics discussed in this article, use the links below: