Airflow on Kubernetes : Get started in 10 mins

Airflow on Kubernetes is quite popular isn’t it? There is a good chance that you know Kubernetes, that you even have a Kubernetes cluster and you would like to deploy and run Airflow on it. However, Kubernetes is hard. There is so many things to deal with that it can be really laborious to just deploy an application. Hopefully for us, some super smart people have created Helm. Helm allows you to deploy and configure Helm charts (applications) on Kubernetes. A Helm chart is a collection of multiple Kubernetes YAML manifests describing every components of your application. That my friend, makes your deployments much easier than ever before. And guess what? Apache Airflow just got its own official Helm chart! In today’s tutorial, you are going to discover how to run Airflow on Kubernetes within just 10 mins thanks to the official Helm chart. Let’s get started!

Ops, I almost forgot. If you want to create powerful data pipelines with Airflow, go check out my courses here

Requirements

In this very hands-on focused tutorial you are going to create a local multi-node Kubernetes cluster to deploy and run Airflow on it. Therefore, there are some prerequisites before jumping into the practice part.

First, you need to install Docker and Docker Compose. Why? Because the Kubernetes cluster will be created in Docker. You are going to have multiple Docker containers where each container will represent a Kubernetes node. Now you may ask: “How can I create a Kubernetes cluster with Docker?” Well, say hello to KinD!

KinD is the second tool you need as it allows you to set up and run a Kubernetes cluster using Docker container “nodes”. It was primarily designed for testing Kubernetes but it is perfect to have a quick local development environment to experiment applications that run on top of Kubernetes. If you know MiniKube, Kind is pretty similar now (It wasn’t the case a few years ago).

In addition to Docker and KinD, you obviously need to install Helm as well as Kubectl. As you know, Helm allows you deploy applications on Kubernetes whereas Kubectl allows you to run commands against your Kubernetes Cluster. Without Kubectl, you won’t be able to get the logs of your PODs, debug your errors or check your nodes.

That’s it. To run Airflow on Kubernetes you need 5 tools: Docker, Docker Compose, KinD, Helm and Kubectl. Once you’re done, you’re ready to go!

Materials

If you want to follow this very hands-on beautiful tutorial, check out the repository here where you will find all the materials needed.

Create a Kubernetes Cluster with KinD

Before deploying Airflow on Kubernetes, the first step, create and configure the local Kubernetes cluster with KinD. Hopefully, it’s a pretty easy task. If you take a look into the materials and open the file kind-cluster.yaml, you will obtain the following content:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "node=worker_1"
extraMounts:
- hostPath: ./data
containerPath: /tmp/data
- role: worker
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "node=worker_2"
extraMounts:
- hostPath: ./data
containerPath: /tmp/data 
- role: worker
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "node=worker_3"
extraMounts:
- hostPath: ./data
containerPath: /tmp/data

This file describe your local Kubernetes Cluster. With it, you are going to create 4 nodes, 1 control plane and 3 workers. Notice that each worker node has a label attached to it. It’s useful if you want to run your tasks on a specific node based on the label of that node. For example, with the KubernetesExecutor you could define a nodeSelector in the executor_config argument of your Operator and run the task on a specific node πŸ˜‰

In addition, each work node has the extraMounts field defined. Why? It allows to mount a hostPath persistent volume to store the logs of your tasks from your pods to your machine. Without this, you wouldn’t be able to access your logs once your tasks get completed in Airflow. That means, next to your kind-cluster.yaml, you should create a folder data and the logs of your tasks will be stored in /tmp/data within your Kubernetes cluster.

Ok, now execute the following command to create your Kubernetes cluster

kind create cluster --name airflow-cluster --config kind-cluster.yaml

You can do some checks

kubectl cluster-info
kubectl get nodes -o wide

Once your local Kubernetes cluster is running, you are ready to deploy Airflow on it.

Airflow on Kubernetes

To deploy Airflow on Kuberntes, the first step is to create a namespace.

kubectl create namespace airflow
kubectl get namespaces

Then, thanks to Helm, you need to fetch the official Helm of Apache Airflow that will magically get deployed on your cluster. Or, almost magically πŸ˜…

helm repo add apache-airflow https://airflow.apache.org
helm repo update
helm search repo airflow
helm install airflow apache-airflow/airflow --namespace airflow --debug

In the order of the commands above, you add the official repository of the Apache Airflow Helm chart is. Then you update the repo to make you got the latest version of it. You can take a look at the current version with search repo. Finally, you deploy Airflow on Kubernetes with Helm install. The application will get the name airflow and the flag –debug allows to check if anything goes wrong during the deployment.

After a few minutes, you should be able to see your Pods running, corresponding to the different Airflow components.

kubectl get pods -n airflow

Don’t hesitate to check the current Helm release of your application with

helm ls -n airflow

Basically, each time your deploy a new version of your Airflow chart (after a modification or an update), you will obtain a new release. One of the most important field to take a look at is REVISION. This number will increase, if you made a mistake you can rollback to a previous revision with helm rollback.

Ok, at this point you have successfully deployed Airflow on Kubernetes as shown below 😎

airflow on kubernetes
Airflow running on your Kubernetes cluster

To access the Airflow UI, open a new terminal and execute the following command

kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow --context kind-airflow-cluster

Configure Airflow on Kubernetes

It’s great to have an Airflow instance running on Kubernetes but it would be better to configure it for your needs right? Let’s do it!

To configure an application that has been deployed with Helm, you need to modify values.yaml. This file describes all the configuration settings of your application such as the Airflow version to deploy, the executor to use, persistence volume mounts, secrets, environment variables and so on.

To get this file (remove the existing one), execute the following command:

helm show values apache-airflow/airflow > values.yaml

As you can see, you got a beautiful large file values.yaml. Open it and change the version of Airflow so that you get the latest version deployed. At the time of this tutorial, the latest version is 2.1.1. Therefore, in values.yaml modify the following settings:

defaultAirflowTag: "2.1.0"
airflowVersion: "2.1.0"

In addition, you can specify the executor to KubernetesExecutor as the executor by default is the CeleryExecutor. (By the way, I made a tutorial about the KubernetesExecutor here, it’s been a while so just read the first two paragraphs πŸ˜‰)

executor: "KubernetesExecutor"

Also, if you have some variables or connections that you want to export each time your Airflow instance gets deployed, you can define a ConfigMap. Open variables.yaml. This ConfigMap will export the environment variables under data. Great to have some bootstrap connections/variables.

To add it to your Airflow deployments, in values.yaml modify extraEnvFrom

extraEnvFrom: |
configMapRef:
name: 'airflow-variables' 

Then add the ConfigMap to the cluster

kubectl apply -f variables.yaml

And finally, deploy Airflow on Kubernetes again.

helm ls -n airflow 
helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug 
helm ls -n airflow 

This time, you pass the file values.yaml to the command helm upgrade and check the Airflow release before and after the upgrade so that you make sure it gets correctly deployed.

Install dependencies with Airflow on Kubernetes

Let’s imagine one of your DAGs interacts with Spark. The problem is, the provider Spark is not installed in your Airflow instance by default. So how can you install this provider? What is the best way to install any provider in Airflow on Kubernetes?

The answer is… Build your own customer Docker image!

Create a file requirements.txt and in it put the following command

pip install apache-airflow-providers-apache-spark==2.0.0

Then, create a Dockerfile with the following content

FROM apache/airflow:2.1.1
COPY requirements.txt .
RUN pip install -r requirements.txt

Build the custom docker image and load it into your local Kubernetes cluster

docker build -t airflow-custom:1.0.0 .
kind load docker-image airflow-custom:1.0.0 --name airflow-cluster 

Notice that your custom Docker image of Airflow is based on the official Docker image 2.1.1 version. It’s better to put your requirements into a custom Docker image than mounting the requirements.txt file in each pod executing your task. Why? Because each time a task gets executed, you will install the dependencies whereas with the custom Docker image, those dependencies are already installed. You save both time and resources.

Modify the file values.yaml

 defaultAirflowRepository: airflow-custom
defaultAirflowTag: "1.0.0"

And upgrade the chart

helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug
helm ls -n airflow 

You can check the providers with the following command

kubectl exec <webserver_pod_id> -n airflow -- airflow providers list

Deploy your DAGs on Kubernetes with GitSync

There are different ways of deploying your DAGs in an Airflow instance running on Kubernetes but let’s focus on GitSync here.

GitSync acts as a side car container that will run along with your PODs to synchronise the folder dags/ (in the PODs) with the Git repository where your DAGs are stored.

Let’s say you have the following repository. To deploy the DAGs within your Kubernetes cluster you need a few things:

Generate a private key with ssh-keygen

Modify the file values.yaml as follow:

gitSync:
enabled: true
repo: ssh://git@github.com/marclamberti/airflow-2-dags.git
branch: main
rev: HEAD
root: "/git"
dest: "repo"
depth: 1
subPath: ""
sshKeySecret: airflow-ssh-git-secret

Create a secret with your private key in it. Notice that with Kubectl, the value is automatically encoded in Base64

kubectl create secret generic airflow-ssh-git-secret --from-file=gitSshKey=/Your/path/.ssh/id_rsa -n airflow

Check your secret

kubectl get secrets -n airflow

Deploy the public key on the Git repository (Settings -> Deploy Key)

Finally, upgrade your Airflow instance

helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug

Now, if you wait up to 5 mins and refresh the page, you should be able to see the DAG 😁

Logs with Airflow on Kubernetes

With the KubernetesExecutor, when a task is triggered a POD is created. Once the task is completed, the corresponding POD gets deleted and so the logs.

Therefore, you need to find a way to store your logs somewhere so that you can still access them. For local development, the easiest way is to configure a HostPath PV. Let’s do it!

First thing first, you should already have a folder data/ created next to the file kind-custer.yaml.

Next, to provide a durable location to prevent data from being lost, you have to set up the Persistent Volume.

kubectl apply -f pv.yaml
kubectl get pv -n airflow

Then, you create a Persistent Volume Claim so that you bind the PV with Airflow.

kubectl apply -f pvc.yaml 
kubectl get pvc -n airflow

Finally, update the values.yaml

logs:   
persistence:     
enabled: true     
existingClaim: airflow-logs

and redeploy your Airflow instance

helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug

If you run your DAGs and click on your tasks, you will be able to access the logs. Well done 😁

In Practice

If you want to see how to quick start with the official Helm chart, check out the video below

Conclusion

In this tutorial, you have successfully deployed Airflow on a multi nodes Kubernetes Cluster. Even better, you have a real local development environment where you can make experiments and tests without destroying your production environment 😁

PS: If you want to deep dive in Apache Airflow, check out my courses here

1 thought on “Airflow on Kubernetes : Get started in 10 mins”

Leave a Comment

Your email address will not be published. Required fields are marked *