Wondering how to deal with variables in Apache Airflow? Well you are at the right place. In this tutorial, you are going to learn everything you need about the variables in Airflow. What are they, how they work, how can you define them, how to get them and more. If you followed my course “Apache Airflow: The Hands-On Guide”, variables should not sound unfamiliar to you as we quickly manipulated them in a lesson. This time, I’m going to give you all I know about variables so that at the end, you will have a solid knowledge and be ready to use them in your DAGs. Without further waiting, let’s get started!
As you may already know, one of the greatest benefit of using Airflow is to be able to create dynamic DAGs. A example of a dynamic data pipeline could be the creation of N tasks based on a changing list of filenames. These N tasks will be instantiated based on those filenames. Notice that you could imagine doing the same with databases for example. Now the question is, where would create this list of filenames to fetch? Hard coded in the DAG? Hell no! In a variable? Hmm, seems to be a better idea!
Let me give you another example. Let’s say you have configuration settings (not critical), required by your DAGs. I think the best way to illustrate this kind of need is by looking at the KubernetesPodOperator. This operator expects many parameters such as resources to limit cpu and memory utilization, ports, volumes and so on. Again, instead of hard coding this different values, you could define a variable with a JSON dictionary describing these settings.
You can either get or set variables from your DAGs but also from the UI or the Command Line Interface. Notice that it is also possible to use variables in Jinja templates and make your DAGs truly dynamic.
Bottom line: Variables are useful for storing and retrieve data at runtime and avoid hard cording values or code repetitions in your DAGs.
Let’s discover how variables work in Apache Airflow.
How variables work?
Airflow is based on three main components. The web server, the scheduler, and the metadata database. Let’s focus on the metadata database. This database can be backed by any SQL databases compatible with SQLAlchemy such as Postgres, MySQL, SQLite and so on. After initialising Airflow, many tables populated with default data are created. One of the tables is “variable” where the variables are stored. By looking more closely to the table, here is what we get:
4 columns are defined.
- id: incrementing numeric value that will be automatically assigned to your variable
- key: literal string used to retrieve your variable in the table. Must be UNIQUE.
- val: literal string corresponding to the value of your variable.
- is_encrypted: boolean indicating if the value of the variable is encrypted or not. As long as the FERNET_KEY parameter is set, your variables will be encrypted by default. If you don’t know what I’m talking about, check my course, where I show you how it works.
Once you create a variable in Airflow, here is what you get:
As you can see, I created a variable with the key my_first_var, with the id 1 was assigned, since it is the first variable in the table, and a very strange string as value since the variable is encrypted. Don’t worry I will come back at it in minute.
All right, so we know, where variables are stored, now what’s the catch?
Best practices with variables in Airflow
Something you should absolutely know about Airflow is how the scheduler works. The scheduler is the masterpiece and only by understanding its mechanism, you will be able to avoid some gotchas that could drastically reduced the performances of your Airflow instance. One of the most common mistake I see in DAGs is code outside of task definitions.
The scheduler of Airflow parses all the DAGs in background at a specific interval of time defined by the paramter process_poll_interval in airflow.cfg. This parameter is set to 1 by default, meaning every second your DAGs will be parsed. Since variables create a connection to the meta database each time they want to fetch a value, if you either set or get a variable outside of tasks, you may end up with a lot of open connections. If you have many DAGs with many variables, that can be a big waste of resources and decrease performances.
Bottom line: Don’t write any code outside of tasks.
How to set a variable in Airflow?
There are three ways of defining variables in Apache Airflow. The most intuitive way is through the User Interface.
Nothing much to say, just go to Admin -> Variables -> Create and you will land on this beautiful interface.
Command line interface
The second way of creating variables is by using the command line interface.
You can perform CRUD operations on variables with the command airflow variables. The command below allows you to set a variable my_second_var with the value my_value.
We can export the variables in a JSON file
Then, if we open the file my_variables.json we get:
The last way of defining variables is by code. Indeed, you can create variables directly from your DAGs with the following code snippet:
Remember, don’t put any get/set of variables outside of tasks.
All right, now we have seen the different ways of defining variables, let’s discover how to get them.
How to get a variable in Airflow?
The ways of setting variables in Airflow are the same ways for getting them. One very important thing to keep in mind is where Airflow will look first for your variables. Let me show you this little schema below:
At you can see, there are two “components” that we haven’t seen yet which are backend secrets and airflow environment variables. Don’t worry, I will come back at them in the tutorial. Just keep in mind that Airflow goes through 2 “layers” before reaching the metastore. If the variable is found in one of these two layers, Airflow doesn’t need to create a connection and so, it is better optimized. That being said, let’s discover different ways of getting a variable in Airflow.
Well that’s pretty simple, just go to the UI, Admin and Variables, then you will get an access to your variables as shown below
Command Line Interface
In order to get a variable through the command line interface, execute the following command:
and you will get the decrypted value of the variable.
In your DAGs, there are two ways of getting your variables. Either by using the class “Variable” as shown below:
Or, by leveraging Jinja if you are trying to fetch a variable from a template:
Now you might say, “What the hell is Jinja and a template?”, well you’re lucky because I made a tutorial about it right here 🙂
All right, at this point, you just learned the basics of dealing with variables in Airflow and a little more. In the next sections, we are going to dive a little deeper and discover other ways of using them.
How to hide the value of a variable?
One thing you may want is to hide the values of your sensitive variables from the UI. Indeed, if you store like AWS keys or passwords, it would be better to prevent from reading them by anyone. Hidding your variables in Airflow is pretty easy. You just need to add one of the following string in the key of your variable:
For example, if I have a variable with the key: aws_secret_access_key, this is what I will obtain from the UI:
As you can see, the password is hidden.
How to mix variables and templates in DAGs?
If you want to unleash the full power and flexibility of Airflow, you have to understand how the Jinja template engine works and what you can do by mixing templates with variables and macros. You will be able to modify and insert data to your DAGs and runtime, and act according to these values.
For example, let’s say you have a DAG fetching credit card movements from clients to processing them. Instead of having this kind of DAG:
You could have:
Where the task “fetching_clients” will fetch the list of clients and then, for each client, a task with their name will be created. Why it it’s better? Because in the first DAG, if for one client, the processing fails, you will have to retry the task for all of your clients and find who is the client in error. Whereas, in DAG 2, you will be able to identify the client very quickly and only retry the corresponding task. Better optimized, more robust and faster results.
After this quick introduction, if you want to learn how to deal with templates, variables and macros in Airflow, check out my tutorial right here.
Optimizing variables with the JSON format
If you have multiple values with a possible hierarchy that you would like to store in a variable, like configuration settings, it would be more suitable to store this data in a friendly format. Well, Airflow allows you to set and get variables in JSON format. For example, let’s say we have the following JSON data:
If you store it in a variable named “settings”.
From your DAG, you could either get this JSON data with:
Or if you are in a template:
When you have multiple values that can be logically regrouped, I strongly encourage you to store them in a variable in JSON Format. By doing this, you will avoid having to make multiple requests to the metadata database as only one will be enough to get everything you need. Less connections is better.
Storing variables in environment variables
If you remember the schema showing the different layers at which Airflow tries to fetch variables, there is one layer before the metastore called “AIRFLOW__ENV”.
Indeed, since Apache Airflow 1.10.10, it is possible to store and fetch variables from environment variables just by using a special naming convention. Any environment variables prefixed by AIRFLOW_VAR_<KEY_OF_THE_VAR> will be taken into account by Airflow.
Concretely, in your bash session, you could execute the following commands:
To create two variables AWS_ACCESS_KEY_ID and SETTINGS. Fetching them from your DAGs is done exactly like with any other variables.
If you want to see that in action, I made a special video just below:
How to get environment variables from your DAGs?
What about if you would like to fetch an environment variable which isn’t prefixed by AIRFLOW_VAR. Well, in your DAG you could the following thing:
Nothing specific to Airflow here, it’s what you would do in any Python code. Here, we print the value of the environment variable CASSANDRA_PASSWORD (which is not really smart btw ;p ) by using the module os of Python.
All right, pretty dense tutorial isn’t it? I hope you really enjoyed what you’ve learned. Airflow is a really powerful orchestrator with many features to discover. If you want to discover Airflow, go check my course The Complete Hands-On Introduction to Apache Airflow right here. Or if you already know Airflow and want to go way much further, enrol in my 12 hours course here.
You may noticed that I didn’t talk about the Backend Secret yet. Well that’s for another video/tutorial very soon. If you want to stay in touch, fill the form below.
Have a great day! 🙂