How to use timezones in Apache Airflow

Dealing with timezones in general can become a real nightmare if they are not correctly used. Understanding how timezones in Apache Airflow work is important since you may want to schedule your DAGs according to your local time zone, which can lead to surprises when DST (Daylight Saving Time) happens. There are some subtle notions to grasp with timezones in Apache Airflow such as aware and naive datetime objects and timedelta vs cron expressions. After reading this article, you should be able to trigger your DAGs at the time you expect whatever the time zone used. One more thing, if you like my tutorials, you can support my work by becoming my Patron right here. No obligation but if you want to help me, I will thank you a lot. If you are new to Apache Airflow, you can check my course right here which will give you a solid introduction. Let’s begin !

According to Wikipedia: A time zone is a region of the globe that observes a uniform standard time for legal, commercial and social purposes. Basically, it represents the local time of a region or a country. For example, in Paris where I’m living it is 23:00 whereas in New York it is 17:00. The difference between both hours is expressed by the offset from Coordinated Universal Time (UTC), the world’s time standard. For example, 23:00 in Paris is equal to UTC+2 hours (in summer) and 17:00 in New York is equal to UTC-4 hours (in summer). If I specified “in summer” it’s because those offsets, +2 and -4 change during the DST or Daylight Saving Time. DST is the practice of setting the clocks forward one hour from UTC during summer months, and back again in the fall. For example, in France when fall begins the offset is UTC+1 and when summer begins the offset is UTC+2. Concretely, 17:00 in October becomes 18:00 in August. They are many timezones in the world so let me show you a representation of them with the following map

The map above is divided in slices where each slice is associated with a colour and a time zone. You can find the legend at the bottom of the map where you have UTC +1, +2, +3 on the right side, and UTC -1, -2, 3 on the left side. For example, the time zone for France is either UTC+1 or UTC+2 depending if DST is in place. Alright now we know the basics, let’s move to the Apache Airflow part.

Timezones in Apache Airflow

What's the problem with timezones in Apache Airflow?

Let’s imagine you have a DAG used to process data. You set it to be triggered every day of the week at 2 AM in Amsterdam local time (UTC+1). You could end up with a DAG like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pendulum
from airflow import DAG
from airflow.utils import timezone
from airflow.operators.dummy_operator import DummyOperator

from datetime import timedelta, datetime

#local_tz = pendulum.timezone("Europe/Amsterdam")

default_args = {
    'start_date': datetime(2019, 3, 29, 1),# tzinfo=local_tz),
    'owner': 'Airflow'
}

with DAG(dag_id='tz_dag', schedule_interval='0 1 * * *', default_args=default_args) as dag:
    dummy_task = DummyOperator(task_id='dummy_task')
    run_dates = dag.get_run_dates(start_date=dag.start_date)
    next_execution_date = run_dates[-1] if len(run_dates) != 0 else None
    print('datetime from Python is Naive: {0}'.format(timezone.is_naive(datetime(2019, 9, 19))))
    print('datetime from Airflow is Aware: {0}'.format(timezone.is_naive(timezone.datetime(2019, 9, 19)) == False))
    print('DAG timezone: {0} - start_date: {1} - schedule_interval: {2} - Last execution_date: {3} - next execution_date {4} in UTC - next execution_date {5} in local time'.format(
        dag.timezone, 
        dag.default_args['start_date'], 
        dag._schedule_interval, 
        dag.latest_execution_date, 
        next_execution_date,
        local_tz.convert(next_execution_date) if next_execution_date is not None else None
        )) 

By default the DAG is configured in UTC. Meaning, the start_date and the schedule_interval should be set in UTC. Basically, tz_dag starts the 29 of March 2019 at 1 AM in UTC and so at 2 AM in UTC+1 (Amsterdam). The schedule_interval is set with a cron expression indicating that the DAG must be triggered every day at 1 AM in UTC and so, at 2 AM in Amsterdam. I also added three lines at the bottom of the DAG in order to give some useful information that gonna be printed out from the web server logs. Let’s run the DAG to see what we get. As you may now, Airflow triggers DAGs AFTER the start_date + schedule_interval period is elapsed. So, in order to execute the DAG without waiting, we should set the current date (of your computer/VM) to the 30 of March 2019 at 2:01 AM as shown below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28import pendulum from airflow import DAG from airflow.utils import timezone from airflow.operators.dummy_operator import DummyOperator from datetime import timedelta, datetime #local_tz = pendulum.timezone("Europe/Amsterdam") default_args = { 'start_date': datetime(2019, 3, 29, 1),# tzinfo=local_tz), 'owner': 'Airflow' } with DAG(dag_id='tz_dag', schedule_interval='0 1 * * *', default_args=default_args, catchup=False) as dag: dummy_task = DummyOperator(task_id='dummy_task') run_dates = dag.get_run_dates(start_date=dag.start_date) next_execution_date = run_dates[-1] if len(run_dates) != 0 else None print('datetime from Python is Naive: {0}'.format(timezone.is_naive(datetime(2019, 9, 19)))) print('datetime from Airflow is Aware: {0}'.format(timezone.is_naive(timezone.datetime(2019, 9, 19)) == False)) print('DAG timezone: {0} - start_date: {1} - schedule_interval: {2} - Last execution_date: {3} - next execution_date {4} in UTC - next execution_date {5} in local time'.format( dag.timezone, dag.default_args['start_date'], dag._schedule_interval, dag.latest_execution_date, next_execution_date, local_tz.convert(next_execution_date) if next_execution_date is not None else None ))

Then, after having turned on the toggle of the DAG tz_dag from the Airflow UI, if you refresh the page, you should obtain the following view:

airflow_ui_time_zones_airflow

As you can see, the DAG has been triggered as expected. Now, we gonna change the date again but this time we set it just before the Daylight Saving Time happens in Europe as shown below:

airflow_ui_time_zones_airflow

When the DST will be in place, at 2 AM it will be 3 AM in Europe. So the time zone will be shifted from UTC+1 to UTC+2. Once it’s done, if you refresh the page of the Airflow UI, you will get the following screen:

ui_time_zones_in_apache_airflow

Notice that the date shown in UTC from the Airflow UI is broken when the shift from UTC+1 to UTC+2 happened. Indeed, since we are at 3:01 in local time, we should have 01:02 in UTC from the Airflow UI. By the way, you can see from the execution_date of the DAG that it gets triggered the 31 of March 2019 at 1 AM in UTC as expected. Alright. Now the question is, are we still going to have the DAG getting triggered at 2 AM in local time zone (UTC+2 now) ? Well, if we change the current date to the 1st of April 2019 at 2 AM, let’s see what happens.

clock_airflow_time_zones

I can already tell you, the DAG won’t be executed. Why? Because if you remember, I told you that by default, the DAG is configured in UTC and so it is for the schedule_interval. You can see this from the logs given by the web server below:

airflow_logs_time_zones

Pay attention to the time zone of the DAG as well as the difference of 2 hours between the next execution_date in UTC and the next execution_date in local time. Since we set the cron expression to run every day at 1 AM in UTC, whatever the current time zone, it will always schedule the DAG at 1 AM in UTC. Because the current date has changed to UTC+2, the DAG will run at 3 AM in local time (1 AM in UTC) which is wrong. If we set the DAG to be run at 2 AM in local time, it should stay at 2 AM even if the DST is in place. We have diagnosed the problem, let’s see how can we fix it. 

Naive vs Aware datetime objects

Before moving to the solution, you have to know the difference between naive datetime objects and aware datetime objects. By default, a datetime object created with the Python package datetime is naive. The timezone is not specified even in UTC. We can do a simple check with Python:

Naive datetime objects with timezones in Apache Airflow
Naive datetime objects from the Python library datetime

As you can see from the screenshot above, none of these results has timezone information. The problem is, as soon as you use one of this object, there is no way to know what the time zone is and they can be misinterpreted. Conversely, an aware datetime object is when the timezone is specified within the object. For exemple, if we use the timezone module of Airflow we obtain the following result:

Aware datetime objects with timezones in Apache Airflow
Aware datetime objects from airflow.utils

Can you see the difference? This time, we have an additional argument named tzinfo indicating the current timezone of the object.

The Python package Pendulum is used in order to deal with timezones in Apache Airflow. By default, all the datetime objects you gonna create, even naive ones, are gonna be converted into aware datetime objects by Airflow. It automatically converts the start_date and the end_date in UTC aware datetime objects, as you can see from the source code below:

start_date_end_date_converted_in_airflow_utc
Source code of Airflow where start_date and end_date are converted in UTC

Alright, now you know the difference, let’s see how to make our DAG time zone dependent.

Did you learn something? Become my Patron and get more high quality tutorials

Make your DAGs time zone dependent

Rather than a long text, let me show you how to do it in the following video:

 

Cron expressions VS Timedelta objects

The last thing I want to tell you is the difference between cron expressions and timedelta objects. In Airflow, in order to schedule your DAGs, you can either use a cron expression or a timedelta object. So which one to choose? If you refer to the documentation, it is explicitly indicated as a best practice to stick with cron expressions. But, is there any impact with timezones? The answer is yes. If you use a cron expression and set your schedule_interval at 5 AM UTC+1, the DAG will always run at 5 AM UTC+1 even after Daylight Saving Time. On the other side, if you use a timedelta object of 1 day for example, the schedule interval will be kept and so DST will be automatically taken into account. Rather than running at 5 AM UTC+1, it will run at 6 AM UTC+2. 

Now you could say “Ok, so I’m gonna stick with cron expressions as shown in the video”. Yes and no. The example in the video works when the DST happened because the parameter catchup of the DAG is set to True by default. Indeed, if you try with the parameter catchup sets to False, you will see that the DAG won’t be triggered at all during the 31 of March 2019. Why? Because the scheduler thinks we are one hour later from the schedule_interval we defined: 2 AM -> 3 AM (UTC+1 => UTC+2) which is wrong actually. So what should you do? Well it really depends on your use case. If it’s not a problem for you to see your DAG getting triggered 1 hour later in local time then you can use a timedelta object.  Inversely, if it is not possible, then you should use a cron expression with the parameter catchup activated. Be aware that catchup sets to True can have impacts on your DAG execution and you should be aware of that.

If you want to learn more about catchup you can check my course right here a get a special promotion. 

Conlusion

As you have seen, timezones in Apache Airflow are easy to implement but can induce strange behaviours if you don’t understand they work. Especially when changes happen like during the Daylight Saving Time. We also have discovered the difference between cron expressions and timedelta objects.

I hope you enjoy this tutorial, don’t hesitate to let me a comment. If you would like to see more content like this one, you can support my work by becoming my Patron here and get an exclusive access on my courses and tutorials. Wish you the best. Take care.

Interested by learning more? Stay tuned and get special promotions!

Liked it? Join the Patreon Community and get an access to exclusive content now!

1 thought on “How to use timezones in Apache Airflow”

  1. Thanks! A small thing that was bugging me. Looking forward to making sure I have this sorted out.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top