Variables in Apache Airflow: The Guide

Wondering how to deal with variables in Apache Airflow? Well you are at the right place. In this tutorial, you are going to learn everything you need about the variables in Airflow. What are they, how they work, how can you define them, how to get them and more. If you followed my course “Apache Airflow: The Hands-On Guide”, variables should not sound unfamiliar to you as we quickly manipulated them in a lesson. This time, I’m going to give you all I know about variables so that at the end, you will have a solid knowledge and be ready to use them in your DAGs. Without further waiting, let’s get started!

As you may already know, one of the greatest benefit of using Airflow is to be able to create dynamic DAGs. A example of a dynamic data pipeline could be the creation of N tasks based on a changing list of filenames. These N tasks will be instantiated based on those filenames. Notice that you could imagine doing the same with databases for example. Now the question is, where would create this list of filenames to fetch? Hard coded in the DAG? Hell no! In a variable? Hmm, seems to be a better idea! 

Let me give you another example. Let’s say you have configuration settings (not critical), required by your DAGs. I think the best way to illustrate this kind of need is by looking at the KubernetesPodOperator. This operator expects many parameters such as resources to limit cpu and memory utilization, ports, volumes and so on. Again, instead of hard coding this different values, you could define a variable with a JSON dictionary describing these settings.

You can either get or set variables from your DAGs but also from the UI or the Command Line Interface. Notice that it is also possible to use variables in Jinja templates and make your DAGs truly dynamic. 

Bottom line: Variables are useful for storing and retrieve data at runtime and avoid hard cording values or code repetitions in your DAGs.

Let’s discover how variables work in Apache Airflow.

How variables work?

Airflow is based on three main components. The web server, the scheduler, and the metadata database. Let’s focus on the metadata database. This database can be backed by any SQL databases compatible with SQLAlchemy such as Postgres, MySQL, SQLite and so on. After initialising Airflow, many tables populated with default data are created. One of the tables is “variable” where the variables are stored. By looking more closely to the table, here is what we get:

                                       Table "public.variable"
    Column    |          Type          | Collation | Nullable |               Default
--------------+------------------------+-----------+----------+--------------------------------------
 id           | integer                |           | not null | nextval('variable_id_seq'::regclass)
 key          | character varying(250) |           |          |
 val          | text                   |           |          |
 is_encrypted | boolean                |           |          |
Indexes:
    "variable_pkey" PRIMARY KEY, btree (id)
    "variable_key_key" UNIQUE CONSTRAINT, btree (key)

4 columns are defined.

  • id: incrementing numeric value that will be automatically assigned to your variable
  • key: literal string used to retrieve your variable in the table. Must be UNIQUE.
  • val: literal string corresponding to the value of your variable.
  • is_encrypted: boolean indicating if the value of the variable is encrypted or not. As long as the FERNET_KEY parameter is set, your variables will be encrypted by default. If you don’t know what I’m talking about, check my course, where I show you how it works.

Once you create a variable in Airflow, here is what you get:

 id |     key      |                                                 val
                                           | is_encrypted
----+--------------+------------------------------------------------------------------------------------------------------+--------------
  1 | my_first_var | gAAAAABeoWewkbpjOJmhgaWx73VpCHPI858rq4e9kawYGxzrJNpgSM63mIouJsNaM15TRtqX4NNih-MiKSed9468ZLLCygwdfA== | t
(1 row)

As you can see, I created a variable with the key my_first_var, with the id 1 was assigned, since it is the first variable in the table, and a very strange string as value since the variable is encrypted. Don’t worry I will come back at it in minute.

All right, so we know, where variables are stored, now what’s the catch?

Best practices with variables in Airflow

Something you should absolutely know about Airflow is how the scheduler works. The scheduler is the masterpiece and only by understanding its mechanism, you will be able to avoid some gotchas that could drastically reduced the performances of your Airflow instance. One of the most common mistake I see in DAGs is code outside of task definitions.

The scheduler of Airflow parses all the DAGs in background at a specific interval of time defined by the paramter process_poll_interval in airflow.cfg. This parameter is set to 1 by default, meaning every second your DAGs will be parsed. Since variables create a connection to the meta database each time they want to fetch a value, if you either set or get a variable outside of tasks, you may end up with a lot of open connections. If you have many DAGs with many variables, that can be a big waste of resources and decrease performances.

Bottom line: Don’t write any code outside of tasks.

How to set a variable in Airflow?

There are three ways of defining variables in Apache Airflow. The most intuitive way is through the User Interface. 

User Interface
variables in airflow

Nothing much to say, just go to Admin -> Variables -> Create and you will land on this beautiful interface.

Command line interface

The second way of creating variables is by using the command line interface.

variables in airflow cli

You can perform CRUD operations on variables with the command airflow variables. The command below allows you to set a variable my_second_var with the value my_value.

airflow variables -s my_second_var my_value

We can export the variables in a JSON file

airflow variables -e my_variables.json

Then, if we open the file my_variables.json we get:

{
    "my_first_var": "my_first_value",
    "my_second_var": "my_value"
}
Code

The last way of defining variables is by code. Indeed, you can create variables directly from your DAGs with the following code snippet:

from airflow.models import Variable
my_var = Variable.set("my_key", "my_value")

Remember, don’t put any get/set of variables outside of tasks.

All right, now we have seen the different ways of defining variables, let’s discover how to get them.

How to get a variable in Airflow?

The ways of setting variables in Airflow are the same ways for getting them. One very important thing to keep in mind is where Airflow will look first for your variables. Let me show you this little schema below:

airflow_order_variables

At you can see, there are two “components” that we haven’t seen yet which are backend secrets and airflow environment variables. Don’t worry, I will come back at them in the tutorial. Just keep in mind that Airflow goes through 2 “layers” before reaching the metastore. If the variable is found in one of these two layers, Airflow doesn’t need to create a connection and so, it is better optimized. That being said, let’s discover different ways of getting a variable in Airflow.

User Interface

Well that’s pretty simple, just go to the UI, Admin and Variables, then you will get an access to your variables as shown below

apache_airflow_variable_view
Command Line Interface

In order to get a variable through the command line interface, execute the following command:

airflow variables -g key_of_the_variable

and you will get the decrypted value of the variable.

Code

In your DAGs, there are two ways of getting your variables. Either by using the class “Variable” as shown below:

from airflow.models import Variable
my_var = Variable.get("my_key")

Or, by leveraging Jinja if you are trying to fetch a variable from a template: 

 {{ var.value.<variable_key> }}

Now you might say, “What the hell is Jinja and a template?”, well you’re lucky because I made a tutorial about it right here 🙂

All right, at this point, you just learned the basics of dealing with variables in Airflow and a little more. In the next sections, we are going to dive a little deeper and discover other ways of using them.

How to hide the value of a variable?

One thing you may want is to hide the values of your sensitive variables from the UI. Indeed, if you store like AWS keys or passwords, it would be better to prevent from reading them by anyone. Hidding your variables in Airflow is pretty easy. You just need to add one of the following string in the key of your variable:

apache_airflow_sensitive_variables

For example, if I have a variable with the key: aws_secret_access_key, this is what I will obtain from the UI:

apache_airflow_variable_hidden

As you can see, the password is hidden.

How to mix variables and templates in DAGs?

If you want to unleash the full power and flexibility of Airflow, you have to understand how the Jinja template engine works and what you can do by mixing templates with variables and macros. You will be able to modify and insert data to your DAGs and runtime, and act according to these values. 

For example, let’s say you have a DAG fetching credit card movements from clients to processing them. Instead of having this kind of DAG:

DAG_example_1

You could have:

DAG_example_2

Where the task “fetching_clients” will fetch the list of clients and then, for each client, a task with their name will be created. Why it it’s better? Because in the first DAG, if for one client, the processing fails, you will have to retry the task for all of your clients and find who is the client in error. Whereas, in DAG 2, you will be able to identify the client very quickly and only retry the corresponding task. Better optimized, more robust and faster results.

After this quick introduction, if you want to learn how to deal with templates, variables and macros in Airflow, check out my tutorial right here.

Optimizing variables with the JSON format

If you have multiple values with a possible hierarchy that you would like to store in a variable, like configuration settings, it would be more suitable to store this data in a friendly format. Well, Airflow allows you to set and get variables in JSON format. For example, let’s say we have the following JSON data:

{
    "login": "my_login",
    "password": "my_password",
    "config": {
          "role": "admin"
     }
}

If you store it in a variable named “settings”. 

From your DAG, you could either get this JSON data with:

from airflow.models import Variable
settings = Variable.get("settings", deserialize_json=True)

# And be able to access the values like in a dictionary
print settings['login']
print settings['config']['role']

Or if you are in a template:

 {{ var.json.<variable_key> }}

When you have multiple values that can be logically regrouped, I strongly encourage you to store them in a variable in JSON Format. By doing this, you will avoid having to make multiple requests to the metadata database as only one will be enough to get everything you need. Less connections is better.

Storing variables in environment variables

If you remember the schema showing the different layers at which Airflow tries to fetch variables, there is one layer before the metastore called “AIRFLOW__ENV”.

Indeed, since Apache Airflow 1.10.10, it is possible to store and fetch variables from environment variables just by using a special naming convention. Any environment variables prefixed by AIRFLOW_VAR_<KEY_OF_THE_VAR> will be taken into account by Airflow. 

Concretely, in your bash session, you could execute the following commands:

export AIRFLOW_VAR_AWS_ACCESS_KEY_ID="wejfhwfhwwner"

# Or in JSON
export AIRFLOW_VAR_SETTINGS='{"login":"marc", "password": "my_pass", "config": { "role": "admin" }}'

To create two variables AWS_ACCESS_KEY_ID and SETTINGS. Fetching them from your DAGs is done exactly like with any other variables.

If you want to see that in action, I made a special video just below:

How to get environment variables from your DAGs?

What about if you would like to fetch an environment variable which isn’t prefixed by AIRFLOW_VAR. Well, in your DAG you could the following thing:

import os

dag = DAG(...)

def print_env_var():
    print(os.environ["CASSANDRA_PASSWORD"])

print_context = PythonOperator(
    task_id="cassandra",
    python_callable=print_env_var,
    dag=dag,
)

Nothing specific to Airflow here, it’s what you would do in any Python code. Here, we print the value of the environment variable CASSANDRA_PASSWORD (which is not really smart btw ;p ) by using the module os of Python.

Conclusion

All right, pretty dense tutorial isn’t it? I hope you really enjoyed what you’ve learned. Airflow is a really powerful orchestrator with many features to discover. If you want to discover Airflow, go check my course The Complete Hands-On Introduction to Apache Airflow right here. Or if you already know Airflow and want to go way much further, enrol in my 12 hours course here.

You may noticed that I didn’t talk about the Backend Secret yet. Well that’s for another video/tutorial very soon. If you want to stay in touch, fill the form below.

Have a great day! 🙂 

Interested by learning more? Stay tuned and get special promotions!

2 thoughts on “Variables in Apache Airflow: The Guide”

  1. Thanks for a lot for the info on airflow variables.These airflow variables can be referred multiple time in different dags .But if the scenario is to update variable these variable multiple times as during parallel run and them it would not be useful in that case. Can you suggest how to handle such situations

  2. Hi Marc,
    Thank you for this topic, it helps me so much; specialy for how to get and set variable airflow for this moment.

Leave a Comment

Your email address will not be published. Required fields are marked *