airflow variables

Airflow Variables are easy to use but easy to misuse as well. In this tutorial, you will learn everything you need about variables in Apache Airflow. What are they, how do they work, define one, get the value, and more. If you followed my course “Apache Airflow: The Hands-On Guide” variables shouldn’t sound unfamiliar. This time, I will give you all I know about variables so that, in the end, you will be ready to use Variables in your DAGs properly. Without further waiting, let’s get started!

Why Airflow Variables?

Imagine you have different tasks or DAGs using the same API. Are you going to hardcode this API everywhere? No. One way is to define the API in a Python file and include it wherever you need the API. Another way is to store that API in an Airflow variable. Why? The API doesn’t appear in the code, no risks of pushing any credentials or sensitive values with it in your repo. Variables add a security layer (relative 🥹) and a great way to share common values between tasks.

Another example is if you need configuration settings for operators like the KubernetesPodOperator. This operator expects parameters such as resource limits for CPU and memory, ports, volumes, and so on. Instead of hard-coding these values, you can define a variable with a JSON dictionary that contains them. Even better, create a customer operator with those values as a wrapper around KPO.

Variables are useful for storing and retrieving data at runtime while avoiding hard-coding values and duplicating code in your DAGs.

What is a Variable?

An Airflow variable is a key-value pair to store information within Airflow. They commonly store instance-level information that rarely changes, such as an API key or the path to a configuration file. A variable has five attributes:

  • The id: Primary key (only in the DB)
  • The key: The unique identifier of the variable. It is required.
  • The val: The value to store. It expects a string (JSON works too).
  • The description: Useful to give context around a variable.
  • is_encrypted: Boolean that tells if the Variable’s value is encrypted. By default, Variable values are encrypted in the DB using the Fernet key. (only in the DB)

Here is the representation of a Variable in the DB:

 id |      key      |                                                           val                                                            | 
description | is_encrypted 
----+---------------+--------------------------------------------------------------------------------------------------------------------------+-
------------+--------------
  3 | activity_file | gAAAAABlGYhn9ADRnSL9d9A7ctyfA2PUt97OvTP9kaPSvRcdUAmX85nfk37YnmyyXLHCiI3Oz_3_ljK0szeHoyEq1-iIYx0-VTB2SYq7AJNfzONB91SWCnQ= | 
            | t
  4 | secret_var    | gAAAAABlGYh6-Btm9yZITn3Qu43A2j6C00XUW4lmPAaCFYxYzkCW3NZwlKiFl4kTf9vj9wOMHIqCZMvt6Mpnajmewgri0i25kA==                     | 
            | t

How to create a variable in Airflow?

You can create a Variable using the UI, CLI, API, environment variables, and secret backends.

With the User interface (Admin -> Variables):

airflow variables ui

With the Command line interface:

airflow variables set my_var my_value
airflow variables set -j my_var '{"key": "value"}'

Or you can use the API:

api

Quick exercise: Create a variable using the UI.

I won’t cover how to programmatically create variables as I don’t think it’s relevant. I see no reason to do it, but if you want, here is the way:

from airflow.models import Variable
Variable.set(key="my_regular_var", value="Hello!")

Regardless of how you create a Variable, you will always see the same fields: key, value, and description (except with the CLI). The variable will be stored in the DB unless…you use environment variables!

Airflow variables with Environment Variables

Remember: An environment variable is a user-definable value that can affect how running processes behave on a computer. Environment variables are part of the environment in which a process runs (cf Wikipedia).

The last sentence is important. When you create a variable with an Environment Variable, Airflow doesn’t store the value in the metadatabase but stays in the environment in which Airflow runs. That means you won’t make a connection to the database each time you fetch the variable. If you have many variables, that can improve Scheduler performances. However, they may not be accessible everywhere, especially if you run Airflow in a cluster with different workers (computers).

To create a Variable with this method, export the following environment variable:

AIRFLOW_VAR_MY_VAR='my_value'
AIRFLOW_VAR_MY_VAR='{"key": "value"}'

Again, if you use a distributed Airflow environment, ensure the variable is exported on the different nodes.

Airflow variables with Secret Backends

If you need to store sensitive information or secrets in your Variables, use Secret Backends.

A secret backend allows the storage of arbitrary values as key-value pairs using a third-party service that adds a security layer. The AWS Secrets Manager, AWS SSM Parameter Store, Google Cloud Secrets Manager, and Hashicorp Vault are some examples.

To use the AWS Secrets Manager as a secret backend, you need to set the Airflow configuration as follows:

[secrets]
backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
backend_kwargs = {
  "variables_prefix": "airflow/variables",
}

backend defines the secret manager to use, and backend_kwargs allows to pass some options. With this configuration, when you create a variable in the Secrets Manager, you must add the prefix airflow/variables. For example, airflow/variables/my_var. However, you will use the key without the prefix to fetch a variable.

How to get a variable in Airflow?

As for creating variables, there are different ways of getting variables.

With the command line interface:

airflow variables get my_var

You can export your variables:

airflow variables export my_export.json 

Be careful, as someone accessing this command can view the variable’s values in plain text. If you have secrets, that can be a significant security breach.

You can retrieve a variable with the API:

airflow variable get api

Or programmatically in your tasks:

from airflow.models import Variable

@dag(...)
my_dag():

  @task
  def my_task():
    my_var = Variable.get("my_var", default_var="default_value")

  my_task()

my_dag()

If you expect to fetch a variable with a serialized JSON value, ensure to set deserialize_json=True as shown below:

from airflow.models import Variable

@dag(...)
my_dag():

  @task
  def my_task():
    my_var = Variable.get("my_var", deserialize_json=True, default_var="default_value")
    print(my_var["key"])

  my_task()

my_dag()

Performance considerations

Here is what happens when you fetch a variable:

Credits: Vincent Beck

Airflow first checks the secret backend, the environment variables, and the database. So, when you retrieve a variable, you make a request through a network (except for environment variables). That can have dramatic performance implications. If you fetch a variable outside of a task, like at top level of your DAG:

from airflow.models import Variable

my_var = Variable.get("my_var", default_var="default_value")

@dag(...)
my_dag():

  @task
  def my_task():
    print(my_var)

  my_task()

my_dag()

Each time the Airflow scheduler parses the DAG file (every 30s), that creates a connection even if the DAG isn’t running! Imagine what happens with hundreds of DAGs. The parsing time will keep increasing, you will put an unnecessary workload on your database and scheduler, and it can become costly if you rely on a third-party database or secret manager.

Avoid at all cost fetching variables outside of tasks. If you can’t, enable this configuration setting introduced in Airflow 2.7:

AIRFLOW__SECRETS__USE_CACHE=True

Enables local caching of Variables when parsing DAGs only. This option can make dag parsing faster if Variables are used in top-level code, at the expense of more extended change propagation time. Note that this cache concerns only the DAG parsing step.

You can control the duration for which to consider an entry in the cache to be valid with:

AIRFLOW__SECRETS__CACHE_TTL_SECONDS=900

Entries are refreshed if they are older than 900 seconds. This is the maximum time you must wait to see a Variable change take effect.

Variables and Jinja templating

Airflow uses Jinja templating to pass dynamic information into task instances at runtime. For example:

BashOperator(
    task_id="print_logical_date",
    bash_command="echo Today is {{ ds }}",
)

prints the DAG run’s logical date as YYYY-MM-DD. If you don’t know what’s a logical date, take a look at the video here.

The point is that you can use Jinja with variables:

BashOperator(
    task_id="print_my_var",
    bash_command="echo My var is {{ var.value.my_var }}",
)

And if the Variable has a JSON serialized value, you can use:

BashOperator(
    task_id="print_my_var",
    bash_command="echo My var is {{ var.json.my_var.my_key }}",
)

The interesting feature of getting variables this way is that no requests are made until Airflow runs the task. This isn’t the case if you do something like (without Jinja):

BashOperator(
    task_id="print_my_var",
    bash_command=f"echo My var is { Variable.get("my_var") }",
)

That task connects to the database even if it doesn’t run!

Hide sensitive values

Airflow uses Fernet encryption to secure the variables stored in its meta database. However, that doesn’t hide Variable’s values on the UI or in the logs. By default, the hide_sensitive_var_conn_fields configuration is set to True, which automatically masks all Airflow variables that contain the following strings:

  • access_token
  • api_key
  • apikey
  • authorization
  • passphrase
  • passwd
  • password
  • private_key
  • secret
  • token

You can extend that list by adding comma-separated strings to the sensitive_var_conn_names configuration.

Here is an example:

hide variables

Best practices for Airflow variables

Here are a few best practices with Airflow variables:

  • Use variables for runtime-dependent information that does not change too frequently.
  • Create a variable with a JSON value if you must fetch multiple values simultaneously. That way, you make one request.
  • Store sensitive information in a Secret Backend instead of the metadatabase or environment variables.
  • Avoid fetching variables in top-level DAG code. That will create a connection at each DAG parsing loop (every 30 seconds).
  • Enable AIRFLOW__SECRETS__USE_CACHE
  • Use the Jinja template syntax to render Variables only when tasks run.
  • Hide variable’s values on the UI and in the logs with the specific prefixes (secret, password, etc)

Conclusion

Airflow Variables are helpful to avoid hardcoding the same values everywhere in your DAGs and tasks. If need to update a value, you do it once with the Variable. You can easily track your constant values. Don’t hesitate to use them but keep in mind the best practices shown at the end of this tutorial.

Leave a Reply

Your email address will not be published. Required fields are marked *