Apache Airflow is an open source workflow management tool used to author, schedule, and monitor ETL pipelines and machine learning workflows among other uses. To make easy to deploy a scalable Apache Arflow in production environments, Bitnami provides an Apache Airflow Helm chart comprised, by default, of three synchronized nodes: web server, scheduler, and workers. You can add more nodes at deployment time or scale the solution once deployed.
Users of Airflow create Directed Acyclic Graph (DAG) files to define the processes and tasks that must be executed, in what order, and their relationships and dependencies. DAG files can be loaded into the Airflow chart. This tutorial shows how to deploy the Bitnami Helm chart for Apache Airflow loading DAG files from a Git repository at deployment time. In addition, you will learn how to add new DAG files to your repository and upgrade the deployment to update your DAGs dashboard. This process is the same that you should follow in case you want to introduce any change in your DAG files.
Assumptions and prerequisites
This guide assumes that:
- You have a Kubernetes cluster running with Helm (and Tiller if using Helm v2.x) installed.
- You have the kubectl command line (kubectl CLI) installed.
- You have already created custom DAGs and have the source code in a GitHub repository.
This guide assumes that you already have your DAGs in a GitHub repository but, if you don't, you can use this example repository. This example repository contains a selection of the example DAGs referenced in the Apache Airflow official GitHub repository.
To successfully load your custom DAGs into the chart from a GitHub repository, it is necessary to only store DAG files in the repository you will synchronize with your deployment. If you use the example repository provided in this guide, clone or fork the whole repository and then, make sure that you move the airflow-dag-examples folder content to a standalone repository.
Step 1: Deploy Apache Airflow and load DAG files
The first step is to deploy Apache Airflow on your Kubernetes cluster using Bitnami's Helm chart. Follow these steps:
First, add the Bitnami charts repository to Helm:
helm repo add bitnami https://charts.bitnami.com/bitnami
Next, execute the following command to deploy Apache Airflow and to get your DAG files from a Git Repository at deployment time. Remember to replace the REPOSITORY_URL placeholder with the URL of the repository where the DAG files are stored.
helm install myrelease bitnami/airflow --set airflow.cloneDagFilesFromGit.enabled=true --set airflow.cloneDagFilesFromGit.repository=REPOSITORY_URL --set airflow.cloneDagFilesFromGit.branch=master --set airflow.baseUrl=http://127.0.0.1:8080
The Apache Airflow chart requires to provide an external URL at installation time. This can be done once the chart is deployed. In that case, you'll need to configure Airflow with a resolvable host and later, upgrade the chart. To avoid the possibility of overwriting the initial parameters you added to load your DAG files, the command above includes the required parameter to configure the URL of your service.
Wait for a few minutes until the chart is deployed. In the "Notes" section you will find the instructions to get the Airflow URL and login credentials. Run these commands:
- Browse to http://127.0.0.1:8080/ and login with the credentials obtained in the step above. You should see the list of the imported DAGs in the dashboard:
There are three different ways to load your custom DAG files into the Airflow chart. Check out the Bitnami Apache Airflow Helm chart documentation to learn alternative ways to load your DAG files..
Step 2: Trigger, monitor, and check DAG runs
Once you access the Airflow home page, triggering a DAG is as easy as clicking the "Play" icon.
- To monitor your scheduler process, click on one of the circles in the "DAG Runs" section. You will be redirected to a screen with the details of the run:
- Click the "Dag Id" to check the pipeline process in the graph view:
- From this view, you can navigate through different tabs to see the tree view of the pipeline process, task duration, DAG details, DAG code, and so on.
Step 3: Update the source repository and synchronize it with your Airflow deployment
To check that your GitHub repository and your Airflow deployment are correctly synchronized, you can perform a change in any of the DAG files and upgrade the deployment to make these changes take effect. Follow these steps:
Perform a change in your DAG files or repository. In this example, two new DAG files have been added to the repository.
Upgrade your deployment by executing the helm upgrade command. Add again all the parameters used in the initial deployment, including the URL of your service:
helm upgrade my-release bitnami/airflow --set airflow.cloneDagFilesFromGit.enabled=true --set airflow.cloneDagFilesFromGit.repository=https://REPOSITORY_URL --set airflow.cloneDagFilesFromGit.branch=master --set airflow.baseUrl=http://127.0.0.1:8080
Back to the browser and refresh the application tab. You will see the two new DAGs in the DAG dashboard:
- Click the "Play" icon in any of the new DAGs and check its pipeline process as explained.