Airflow on Kubernetes is quite popular isn’t it? There is a good chance that you know Kubernetes, that you even have a Kubernetes cluster and you would like to deploy and run Airflow on it. However, Kubernetes is hard. There is so many things to deal with that it can be really laborious to just deploy an application. Hopefully for us, some super smart people have created Helm. Helm allows you to deploy and configure Helm charts (applications) on Kubernetes. A Helm chart is a collection of multiple Kubernetes YAML manifests describing every components of your application. That my friend, makes your deployments much easier than ever before. And guess what? Apache Airflow just got its own official Helm chart! In today’s tutorial, you are going to discover how to run Airflow on Kubernetes within just 10 mins thanks to the official Helm chart. Let’s get started!
Ops, I almost forgot. If you want to create powerful data pipelines with Airflow, go check out my courses here
In this very hands-on focused tutorial you are going to create a local multi-node Kubernetes cluster to deploy and run Airflow on it. Therefore, there are some prerequisites before jumping into the practice part.
First, you need to install Docker and Docker Compose. Why? Because the Kubernetes cluster will be created in Docker. You are going to have multiple Docker containers where each container will represent a Kubernetes node. Now you may ask: “How can I create a Kubernetes cluster with Docker?” Well, say hello to KinD!
KinD is the second tool you need as it allows you to set up and run a Kubernetes cluster using Docker container “nodes”. It was primarily designed for testing Kubernetes but it is perfect to have a quick local development environment to experiment applications that run on top of Kubernetes. If you know MiniKube, Kind is pretty similar now (It wasn’t the case a few years ago).
In addition to Docker and KinD, you obviously need to install Helm as well as Kubectl. As you know, Helm allows you deploy applications on Kubernetes whereas Kubectl allows you to run commands against your Kubernetes Cluster. Without Kubectl, you won’t be able to get the logs of your PODs, debug your errors or check your nodes.
That’s it. To run Airflow on Kubernetes you need 5 tools: Docker, Docker Compose, KinD, Helm and Kubectl. Once you’re done, you’re ready to go!
If you want to follow this very hands-on beautiful tutorial, check out the repository here where you will find all the materials needed.
Create a Kubernetes Cluster with KinD
Before deploying Airflow on Kubernetes, the first step, create and configure the local Kubernetes cluster with KinD. Hopefully, it’s a pretty easy task. If you take a look into the materials and open the file kind-cluster.yaml, you will obtain the following content:
kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane - role: worker kubeadmConfigPatches: - | kind: JoinConfiguration nodeRegistration: kubeletExtraArgs: node-labels: "node=worker_1" extraMounts: - hostPath: ./data containerPath: /tmp/data - role: worker kubeadmConfigPatches: - | kind: JoinConfiguration nodeRegistration: kubeletExtraArgs: node-labels: "node=worker_2" extraMounts: - hostPath: ./data containerPath: /tmp/data - role: worker kubeadmConfigPatches: - | kind: JoinConfiguration nodeRegistration: kubeletExtraArgs: node-labels: "node=worker_3" extraMounts: - hostPath: ./data containerPath: /tmp/data
This file describe your local Kubernetes Cluster. With it, you are going to create 4 nodes, 1 control plane and 3 workers. Notice that each worker node has a label attached to it. It’s useful if you want to run your tasks on a specific node based on the label of that node. For example, with the KubernetesExecutor you could define a nodeSelector in the executor_config argument of your Operator and run the task on a specific node 😉
In addition, each work node has the extraMounts field defined. Why? It allows to mount a hostPath persistent volume to store the logs of your tasks from your pods to your machine. Without this, you wouldn’t be able to access your logs once your tasks get completed in Airflow. That means, next to your kind-cluster.yaml, you should create a folder data and the logs of your tasks will be stored in /tmp/data within your Kubernetes cluster.
Ok, now execute the following command to create your Kubernetes cluster
kind create cluster --name airflow-cluster --config kind-cluster.yaml
You can do some checks
kubectl cluster-info kubectl get nodes -o wide
Once your local Kubernetes cluster is running, you are ready to deploy Airflow on it.
Airflow on Kubernetes
To deploy Airflow on Kuberntes, the first step is to create a namespace.
kubectl create namespace airflow kubectl get namespaces
Then, thanks to Helm, you need to fetch the official Helm of Apache Airflow that will magically get deployed on your cluster. Or, almost magically 😅
helm repo add apache-airflow https://airflow.apache.org helm repo update helm search repo airflow helm install airflow apache-airflow/airflow --namespace airflow --debug
In the order of the commands above, you add the official repository of the Apache Airflow Helm chart is. Then you update the repo to make you got the latest version of it. You can take a look at the current version with search repo. Finally, you deploy Airflow on Kubernetes with Helm install. The application will get the name airflow and the flag –debug allows to check if anything goes wrong during the deployment.
After a few minutes, you should be able to see your Pods running, corresponding to the different Airflow components.
kubectl get pods -n airflow
Don’t hesitate to check the current Helm release of your application with
helm ls -n airflow
Basically, each time your deploy a new version of your Airflow chart (after a modification or an update), you will obtain a new release. One of the most important field to take a look at is REVISION. This number will increase, if you made a mistake you can rollback to a previous revision with helm rollback.
Ok, at this point you have successfully deployed Airflow on Kubernetes as shown below 😎
To access the Airflow UI, open a new terminal and execute the following command
kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow --context kind-airflow-cluster
Configure Airflow on Kubernetes
It’s great to have an Airflow instance running on Kubernetes but it would be better to configure it for your needs right? Let’s do it!
To configure an application that has been deployed with Helm, you need to modify values.yaml. This file describes all the configuration settings of your application such as the Airflow version to deploy, the executor to use, persistence volume mounts, secrets, environment variables and so on.
To get this file (remove the existing one), execute the following command:
helm show values apache-airflow/airflow > values.yaml
As you can see, you got a beautiful large file values.yaml. Open it and change the version of Airflow so that you get the latest version deployed. At the time of this tutorial, the latest version is 2.1.1. Therefore, in values.yaml modify the following settings:
defaultAirflowTag: "2.1.0" airflowVersion: "2.1.0"
In addition, you can specify the executor to KubernetesExecutor as the executor by default is the CeleryExecutor. (By the way, I made a tutorial about the KubernetesExecutor here, it’s been a while so just read the first two paragraphs 😉)
Also, if you have some variables or connections that you want to export each time your Airflow instance gets deployed, you can define a ConfigMap. Open variables.yaml. This ConfigMap will export the environment variables under data. Great to have some bootstrap connections/variables.
To add it to your Airflow deployments, in values.yaml modify extraEnvFrom
extraEnvFrom: | configMapRef: name: 'airflow-variables'
Then add the ConfigMap to the cluster
kubectl apply -f variables.yaml
And finally, deploy Airflow on Kubernetes again.
helm ls -n airflow helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug helm ls -n airflow
This time, you pass the file values.yaml to the command helm upgrade and check the Airflow release before and after the upgrade so that you make sure it gets correctly deployed.
Install dependencies with Airflow on Kubernetes
Let’s imagine one of your DAGs interacts with Spark. The problem is, the provider Spark is not installed in your Airflow instance by default. So how can you install this provider? What is the best way to install any provider in Airflow on Kubernetes?
The answer is… Build your own customer Docker image!
Create a file requirements.txt and in it put the following command
pip install apache-airflow-providers-apache-spark==2.0.0
Then, create a Dockerfile with the following content
FROM apache/airflow:2.1.1 COPY requirements.txt . RUN pip install -r requirements.txt
Build the custom docker image and load it into your local Kubernetes cluster
docker build -t airflow-custom:1.0.0 . kind load docker-image airflow-custom:1.0.0 --name airflow-cluster
Notice that your custom Docker image of Airflow is based on the official Docker image 2.1.1 version. It’s better to put your requirements into a custom Docker image than mounting the requirements.txt file in each pod executing your task. Why? Because each time a task gets executed, you will install the dependencies whereas with the custom Docker image, those dependencies are already installed. You save both time and resources.
Modify the file values.yaml
defaultAirflowRepository: airflow-custom defaultAirflowTag: "1.0.0"
And upgrade the chart
helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug helm ls -n airflow
You can check the providers with the following command
kubectl exec <webserver_pod_id> -n airflow -- airflow providers list
Deploy your DAGs on Kubernetes with GitSync
There are different ways of deploying your DAGs in an Airflow instance running on Kubernetes but let’s focus on GitSync here.
GitSync acts as a side car container that will run along with your PODs to synchronise the folder dags/ (in the PODs) with the Git repository where your DAGs are stored.
Let’s say you have the following repository. To deploy the DAGs within your Kubernetes cluster you need a few things:
Generate a private key with ssh-keygen
Modify the file values.yaml as follow:
gitSync: enabled: true repo: ssh://firstname.lastname@example.org/marclamberti/airflow-2-dags.git branch: main rev: HEAD root: "/git" dest: "repo" depth: 1 subPath: "" sshKeySecret: airflow-ssh-git-secret
Create a secret with your private key in it. Notice that with Kubectl, the value is automatically encoded in Base64
kubectl create secret generic airflow-ssh-git-secret --from-file=gitSshKey=/Your/path/.ssh/id_rsa -n airflow
Check your secret
kubectl get secrets -n airflow
Deploy the public key on the Git repository (Settings -> Deploy Key)
Finally, upgrade your Airflow instance
helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug
Now, if you wait up to 5 mins and refresh the page, you should be able to see the DAG 😁
Logs with Airflow on Kubernetes
With the KubernetesExecutor, when a task is triggered a POD is created. Once the task is completed, the corresponding POD gets deleted and so the logs.
Therefore, you need to find a way to store your logs somewhere so that you can still access them. For local development, the easiest way is to configure a HostPath PV. Let’s do it!
First thing first, you should already have a folder data/ created next to the file kind-custer.yaml.
Next, to provide a durable location to prevent data from being lost, you have to set up the Persistent Volume.
kubectl apply -f pv.yaml kubectl get pv -n airflow
Then, you create a Persistent Volume Claim so that you bind the PV with Airflow.
kubectl apply -f pvc.yaml kubectl get pvc -n airflow
Finally, update the values.yaml
logs: persistence: enabled: true existingClaim: airflow-logs
and redeploy your Airflow instance
helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug
If you run your DAGs and click on your tasks, you will be able to access the logs. Well done 😁
If you want to see how to quick start with the official Helm chart, check out the video below
In this tutorial, you have successfully deployed Airflow on a multi nodes Kubernetes Cluster. Even better, you have a real local development environment where you can make experiments and tests without destroying your production environment 😁
PS: If you want to deep dive in Apache Airflow, check out my courses here