Dealing with timezones in general can become a real nightmare if they are not correctly used. Understanding how timezones in Apache Airflow work is important since you may want to schedule your DAGs according to your local time zone, which can lead to surprises when DST (Daylight Saving Time) happens. There are some subtle notions to grasp with timezones in Apache Airflow such as aware and naive datetime objects and timedelta vs cron expressions. After reading this article, you should be able to trigger your DAGs at the time you expect whatever the time zone used. One more thing, if you like my tutorials, you can support my work by becoming my Patron right here. No obligation but if you want to help me, I will thank you a lot. If you are new to Apache Airflow, you can check my course right here which will give you a solid introduction. Let’s begin !
According to Wikipedia: A time zone is a region of the globe that observes a uniform standard time for legal, commercial and social purposes. Basically, it represents the local time of a region or a country. For example, in Paris where I’m living it is 23:00 whereas in New York it is 17:00. The difference between both hours is expressed by the offset from Coordinated Universal Time (UTC), the world’s time standard. For example, 23:00 in Paris is equal to UTC+2 hours (in summer) and 17:00 in New York is equal to UTC-4 hours (in summer). If I specified “in summer” it’s because those offsets, +2 and -4 change during the DST or Daylight Saving Time. DST is the practice of setting the clocks forward one hour from UTC during summer months, and back again in the fall. For example, in France when fall begins the offset is UTC+1 and when summer begins the offset is UTC+2. Concretely, 17:00 in October becomes 18:00 in August. They are many timezones in the world so let me show you a representation of them with the following map
Timezones in Apache Airflow
What's the problem with timezones in Apache Airflow?
Let’s imagine you have a DAG used to process data. You set it to be triggered every day of the week at 2 AM in Amsterdam local time (UTC+1). You could end up with a DAG like this:
By default the DAG is configured in UTC. Meaning, the start_date and the schedule_interval should be set in UTC. Basically, tz_dag starts the 29 of March 2019 at 1 AM in UTC and so at 2 AM in UTC+1 (Amsterdam). The schedule_interval is set with a cron expression indicating that the DAG must be triggered every day at 1 AM in UTC and so, at 2 AM in Amsterdam. I also added three lines at the bottom of the DAG in order to give some useful information that gonna be printed out from the web server logs. Let’s run the DAG to see what we get. As you may now, Airflow triggers DAGs AFTER the start_date + schedule_interval period is elapsed. So, in order to execute the DAG without waiting, we should set the current date (of your computer/VM) to the 30 of March 2019 at 2:01 AM as shown below:
Then, after having turned on the toggle of the DAG tz_dag from the Airflow UI, if you refresh the page, you should obtain the following view:
As you can see, the DAG has been triggered as expected. Now, we gonna change the date again but this time we set it just before the Daylight Saving Time happens in Europe as shown below:
When the DST will be in place, at 2 AM it will be 3 AM in Europe. So the time zone will be shifted from UTC+1 to UTC+2. Once it’s done, if you refresh the page of the Airflow UI, you will get the following screen:
Notice that the date shown in UTC from the Airflow UI is broken when the shift from UTC+1 to UTC+2 happened. Indeed, since we are at 3:01 in local time, we should have 01:02 in UTC from the Airflow UI. By the way, you can see from the execution_date of the DAG that it gets triggered the 31 of March 2019 at 1 AM in UTC as expected. Alright. Now the question is, are we still going to have the DAG getting triggered at 2 AM in local time zone (UTC+2 now) ? Well, if we change the current date to the 1st of April 2019 at 2 AM, let’s see what happens.
I can already tell you, the DAG won’t be executed. Why? Because if you remember, I told you that by default, the DAG is configured in UTC and so it is for the schedule_interval. You can see this from the logs given by the web server below:
Pay attention to the time zone of the DAG as well as the difference of 2 hours between the next execution_date in UTC and the next execution_date in local time. Since we set the cron expression to run every day at 1 AM in UTC, whatever the current time zone, it will always schedule the DAG at 1 AM in UTC. Because the current date has changed to UTC+2, the DAG will run at 3 AM in local time (1 AM in UTC) which is wrong. If we set the DAG to be run at 2 AM in local time, it should stay at 2 AM even if the DST is in place. We have diagnosed the problem, let’s see how can we fix it.
Naive vs Aware datetime objects
Before moving to the solution, you have to know the difference between naive datetime objects and aware datetime objects. By default, a datetime object created with the Python package datetime is naive. The timezone is not specified even in UTC. We can do a simple check with Python:
As you can see from the screenshot above, none of these results has timezone information. The problem is, as soon as you use one of this object, there is no way to know what the time zone is and they can be misinterpreted. Conversely, an aware datetime object is when the timezone is specified within the object. For exemple, if we use the timezone module of Airflow we obtain the following result:
Can you see the difference? This time, we have an additional argument named tzinfo indicating the current timezone of the object.
The Python package Pendulum is used in order to deal with timezones in Apache Airflow. By default, all the datetime objects you gonna create, even naive ones, are gonna be converted into aware datetime objects by Airflow. It automatically converts the start_date and the end_date in UTC aware datetime objects, as you can see from the source code below:
Alright, now you know the difference, let’s see how to make our DAG time zone dependent.
Make your DAGs time zone dependent
Rather than a long text, let me show you how to do it in the following video:
Cron expressions VS Timedelta objects
The last thing I want to tell you is the difference between cron expressions and timedelta objects. In Airflow, in order to schedule your DAGs, you can either use a cron expression or a timedelta object. So which one to choose? If you refer to the documentation, it is explicitly indicated as a best practice to stick with cron expressions. But, is there any impact with timezones? The answer is yes. If you use a cron expression and set your schedule_interval at 5 AM UTC+1, the DAG will always run at 5 AM UTC+1 even after Daylight Saving Time. On the other side, if you use a timedelta object of 1 day for example, the schedule interval will be kept and so DST will be automatically taken into account. Rather than running at 5 AM UTC+1, it will run at 6 AM UTC+2.
Now you could say “Ok, so I’m gonna stick with cron expressions as shown in the video”. Yes and no. The example in the video works when the DST happened because the parameter catchup of the DAG is set to True by default. Indeed, if you try with the parameter catchup sets to False, you will see that the DAG won’t be triggered at all during the 31 of March 2019. Why? Because the scheduler thinks we are one hour later from the schedule_interval we defined: 2 AM -> 3 AM (UTC+1 => UTC+2) which is wrong actually. So what should you do? Well it really depends on your use case. If it’s not a problem for you to see your DAG getting triggered 1 hour later in local time then you can use a timedelta object. Inversely, if it is not possible, then you should use a cron expression with the parameter catchup activated. Be aware that catchup sets to True can have impacts on your DAG execution and you should be aware of that.
If you want to learn more about catchup you can check my course right here a get a special promotion.
As you have seen, timezones in Apache Airflow are easy to implement but can induce strange behaviours if you don’t understand they work. Especially when changes happen like during the Daylight Saving Time. We also have discovered the difference between cron expressions and timedelta objects.
I hope you enjoy this tutorial, don’t hesitate to let me a comment. If you would like to see more content like this one, you can support my work by becoming my Patron here and get an exclusive access on my courses and tutorials. Wish you the best. Take care.