Airflow API : The guide to get started now!

The new Airflow API is here! After waiting impatiently for years, the Airflow API is finally stable and reliable enough to be used in production! Back in the day, I’m pretty sure that you remember well the Experimental API and its lack of endpoints and documentation. In fact, as it was “experimental” you shouldn’t even use it in production. Well, this comes to an end! Airflow 2.0 brings the new stable REST API and I can’t tell you how amazing is this. Now you are able to manage your connections, list your DAGs and even get simplified representation of them, trigger a new DAG run, access to your logs, XCOMs and so on. You can fully interact with your Airflow instance from other tools and even build your application on top of it. So, without further do, let’s jump into the new Airflow REST API!

PS: if you are new to Airflow, check my course here, you will get it with a special discount and learn all you need to master Airflow!

The Airflow Experimental REST API

As an image worth thousands of words, here is one pretty self-explanatory

Airflow Experimental REST API reference

I think we can agree that this is not the best documentation for an API. Pretty messy, very limited endpoints, you don’t even know exactly what you will get in return for some of them. I’m not blaming anyone here, build an API is a long and hard work, but you definitely didn’t want to use it in production. I’m not going to lose any more time on it, let’s jump right onto the Airflow STABLE REST API! 😍

The Airflow Stable REST API

First in first, I’m really surprised by the number of people who don’t know that the documentation of the new API is available on the Airflow website. This is the first thing you have to take a look at and here is the link: doc

The Stable REST API meets the following requirements:

  • Easy to maintain, CRUD operations, endpoints are really similar
  • Trustworthy, you can safely make requests on the API and get responses as defined in the API schema
  • Extensible and stable, you won’t get the API changed at every Airflow version, your can rely on it.
  • Secure, you can request the API if you have the required permissions

Ultimately, the API allows you to perform most of the operations that are available through the UI, the experimental API and the CLI ( used by typical users, cf: you can’t change the configuration of Airflow from the API)

To sum up, this new version makes for easy access by third-party tools and follows an industry standard interface with OpenAPI 3. Most of the endpoints are CRUD (Create, read, update, delete) operations to fully control Airflow resources. It is well documented thanks to a beautiful Swagger interface and includes different ways to authorise clients.

All right, enough of theory, let’s move to the practice 🤓

Get started with the Airflow REST API

At this point, I strongly advise to try what I’m going to show you on your computer. You need to have Docker and Docker compose installed on your computer. Then, check the video I made right there, you will learn how to set up Airflow Airflow with Docker in only 5 mins.

To get started with the API, you have to know a very important configuration setting called AUTH_BACKEND. Auth_backend defines how to authenticate users of the API. By default, the value is set to “airflow.api.auth.backend.deny_all” which means, all requests are denied. Notice that if you are running an Airflow version lower than 1.10.11, then you might have security issues as all API requests are allowed without authentication before 1.10.11. ( By the way, you cannot disable the authentication with the stable REST API, airflow.api.auth.backend.default DOESN’T WORK, and that’s great actually).

Airflow brings you different backends to authenticate with the API:

  • Kerberos authentication, if you use Kerberos your can define your backend value to “airlfow.api.auth.backend.kerberos_auth”. They are other parameters you will need to define as described here.
  • Basic username and password authentication. The great thing is that it works either with users created through LDAP or within Airflow DB. Define the auth_backend to “airflow.api.auth.backend.basic_auth”. The username and password must be base64 encoded and sent through the HTTP header. You are going to see that in a minute.
  • You own API authentication. Yes, you can create your own backend! That’s the beauty of Airflow, you can customize it as much as you need. If you want to learn more about this check the documentation here.

That being said, I strongly advise you to take a look at the source code of these different authentication backends. That will help you to truly understand how they work. You can find the code here.

Send your first request

First, make sure your auth_backend setting is defined to “airflow.api.auth.backend.basic_auth”. By default, with the official docker-compose file of Airflow, a user admin with the password admin is created. Once Airflow is up and running you can execute the following request with curl:

curl --verbose 'http://localhost:8080/api/v1/dags' -H 'content-type: application/json' --user "airflow:airflow"

This request lists of all your DAGs and you should obtain a similar json output as shown below:

{
"dags": [
{
"dag_id": "example_bash_operator",
"description": null,
"file_token": ".eJw9yjEOgCAMAMC_sEsHE79DClYgFttQEuX3boyXnIMijQBrv1he8CwJGbhG0DmKPLs_wOqgTTHdmMlWpQ-bMoUTsy1EtBJEqeOQ7nW6H6ZgI_w.PjpiEYEF1Ph3O5aZCIvOz5qAOME",
"fileloc": "/home/airflow/.local/lib/python3.6/site-packages/airflow/example_dags/example_bash_operator.py",
"is_paused": true,
"is_subdag": false,
"owners": [
"airflow"
],
"root_dag_id": null,
"schedule_interval": {
"__type": "CronExpression",
"value": "0 0 * * *"
},
"tags": [
{
"name": "example"
},
{
"name": "example2"
}
]
},

Notice that you can limit the number of items to return by specifying limit in the request. By default that value is set to 100 but you can change it in the configuration settings of Airflow through maximum_page_limit. The goal of this parameter is to protect against requests that may put to much load on the web server and cause instability. The right value depends on how many request you make and the kind of requests you make. There is no hard value here, is based on trial and error as well as your use cases.

How to trigger a DAG from the API

It’s really simple. There is resource called dagRuns in the API to interact with your DAGs. You can list your DAG Runs, trigger a new DAG run or delete one.

To trigger a new DAG run you can execute the following request:

curl -X POST -d '{"execution_date": "2021-01-01T15:00:00Z", "conf": {}}' 'http://localhost:8080/api/v1/dags/example_bash_operator/dagRuns' -H 'content-type: application/json' --user "airflow:airflow"

Some important points to keep in mind:

  • In the documentation you will see that state is defined in the payload. Don’t define it otherwise you will get an error “state is a read only property”
  • I recommend that you do not define the dag_run_id property. That way, you will keep the format generated by Airflow (manual__execution date and so your dag run ids stay homogeneous.
  • The dag run ids have the following format when trigger from the API: manual__2021-01-01T15:00:00+00:00.
  • You have to unpause your DAG first before triggering from the API otherwise only the DAG Run will be shown as running from the UI but no tasks will effectively run.
  • Keep in mind that you cannot trigger more than once a DAG on the same execution date. If you want to do this, you will have to first delete the DAG run and then trigger the DAG run again. You can delete a DAG Run with the following request:
curl -X DELETE 'http://localhost:8080/api/v1/dags/example_bash_operator/dagRuns/manual__2021-01-01T15:00:00+00:00' -H 'content-type: application/json' --user "airflow:airflow"

Notice that if this request works, you won’t receive any output excepts the code 204

Now you may wonder where can you find the dag run id to put in that request? Again, there is a request for that:

curl --verbose 'http://localhost:8080/api/v1/dags/example_bash_operator/dagRuns' -H 'content-type: application/json' --user "airflow:airflow"

To sum up, in order to run a DAG you need to unpause the DAG -> Trigger the DAG run -> Delete the DAG Run ( if you want to run it again on the same execution date )

Last but not least, you can leverage the conf property to pass anything data you want to the DAG Run. In your dag you will just have to access dag_run.conf with the template engine {{ dag_run.cong }} to get back your data.

How to delete a DAG from the API

You can’t. It is as simple as that. You can’t delete a DAG from the API and that makes sense. The role of the API is to reflect what a typical user can do. Deleting a DAG is a critical action that you should not be able to do just by making a request. However, one thing you can do is to pause your DAG so that it doesn’t get triggered anymore. To do this you can make the following request:

curl -X PATCH -d '{"is_paused": true}' 'http://localhost:8080/api/v1/dags/example_bash_operator?update_mask=is_paused' -H 'content-type: application/json' --user "airflow:airflow"

Notice that there is no specific endpoints like with the experimental API in the new Airflow REST API. Here you “patch” the DAG to pause or unpause it.

How to monitor your Airflow instance

I think you agree with me that monitoring your Airflow instance is important. For example, you might want to check if your Airflow scheduler is healthy before triggering a DAG. Well you can do this with the following request:

curl --verbose 'http://localhost:8080/api/v1/health' -H 'content-type: application/json' --user "airflow:airflow"

This request will return to you if the metadatabase as well as the scheduler are healthy. That means if they are “on”. Keep in mind that this request doesn’t tell you if they are under under high pressure, if too much CPU or memory are consumed. Bottom line, you can use that request as a first check but always rely on a real monitoring system as describe here

In addition, you can check if your DAGs have any import issues with the following request:

curl --verbose 'http://localhost:8080/api/v1/importErrors' -H 'content-type: application/json' --user "airflow:airflow"

This can be pretty useful too if you have a lot of DAGs in your Airflow instance.

Conclusion

The new Airflow REST API is a game changer. Now you are able to interact with most of the resources from third parties in a very easy and standardised way. You can build applications on top of Airflow to add features, you can think of new use cases and so much more. I strongly advise you to take a look at the documentation and please please please, forget about the experimental API, there is no reasons to use it anymore.

PS: If you want to get started with Airflow now, take a look at the course I made for you here

See you! 😉

Leave a Comment

Your email address will not be published. Required fields are marked *