DuckDB: Getting started for Beginners

DuckDB is an in-process OLAP DBMS written in C++ blah blah blah, too complicated. Let’s start simple, shall we? DuckDB is the SQLite for Analytics. It has no dependencies, is extremely easy to set up, and is optimized to perform queries on data. In this hands-on tutorial, you will learn what DuckDB is, how to use it, and why it is essential for you. Ready? Let’s go 👇

DuckDB Use Cases

Why DuckDB? As a data analyst or a data scientist, a typical workflow is to load data from a CSV file or an S3 bucket, perform preprocessing steps and run your analysis. To achieve those things, you pull up a Jupyter notebook, import Numpy and Pandas, then execute your queries. What’s interesting is that if you look at your operations, you usually perform database operations such as joins, aggregates, filters, etc. But, instead of using a relational database management system (RDBMS), you use Pandas and Numpy.

When you didn’t know about DuckDB

The question is, why? Setting up a database and loading data in it can be a painful, slow, and frustrating experience. Whereas if it were easy, you would have everything to gain by using an RDBMS.

Say “welcome” to DuckDB! 👋

DuckDB is the easiest and fastest way to analyze data with a DB.

Local

First, DuckDB is an in-process single-file database with no external dependencies. What does that mean? Unlike Postgres, there is no client/server to set up or external dependencies to install. In addition, the data transfer to/from the client is slower than it should be, especially for local installations. On the other hand, DuckDB is embedded in the host application process and manages a database stored in a single file. There is no client/server, and it is straightforward to install. If you need to run a DB for local data analysis, it’s the way to go!

Performances

Second, DuckDB is highly optimized for analytical query workloads (OLAP). Because it is columnar-oriented DB (along with other optimizations), complex-long-running queries to aggregate, join, or read data become blazingly fast! In fact, it’s easy to claim that ~90% of big data workloads can be handled on a single machine without any setup or computation pains with DuckDB. That’s not the case with Pandas. Pandas cannot leverage CPU cores to parallelize computations, making them slow to complete. It operates entirely in memory leading to out-of-memory errors if the dataset is too big.

SQL

Whether you like it or not, SQL is more alive than ever. DuckDB supports a fairly advanced SQL command set, such as window functions and aggregates. It provides transactional guarantees (ACID properties) and uses the Postgres API. Forget about the ugly and slow Panda’s manipulations. You can replace all of them with elegant SQL (it offers some additions on top of SQL to make it more friendly, as shown here). Finally, DuckDB has a Python Client and shines when mixed with Arrow and Parquet.

To sum up, DuckDB is an actively developed open-source, embeddable, in-process, and column-oriented SQL OLAP RDBMS. Perfect for data practitioners that want to perform local analytical workloads easily and quickly.

DuckDB Limitations

DuckDB is designed to run on a single machine. That means your data has to fit with that single machine otherwise, it doesn’t work. That said, they are very powerful computers out there that will be more than enough for 99% of your workload. Just by looking at AWS offerings, you can have an instance with up to 1,952Gib or memory and 128 vCPUs.

Another limitation is that DuckDB is not a multi-tenant database. Having different people with different teams, developing models and sharing data on the database will be very challenging. However, when you integrate the DB with other tools such as Airflow, S3, Parquet and dbt, you can get a powerful data ecosystem with which teams can easily work.

Not a limitation per se, but DuckDB is not a transactional database and should not be used that way.

DuckDB vs SQLite

While SQLite is a general-purpose database engine, it is primarily designed for fast online transaction processing (OLTP).

SQLite is:

  • Cross-platform: A SQLite database is stored in a single file.
  • Compact and self-contained: Available as a small file. It doesn’t require installation or configuration and has no external dependencies.
  • Reliable: Often used in mission-critical applications such as flight software
  • Fast: Can support tens of thousands of transactions per second.
  • ACID guarantees: Transactions are atomic, consistent, isolated and durable

Those features above are the same for DuckDB. That’s why it claims to be the “SQLite for analytics”. So what are the differences?

DuckDB is built from the ground up for in-process OLAP employing columnar storage, vectorized query processing, and multi-version concurrency control optimized for ETL operations. On the other hand, SQLite uses a row-oriented storage format, meaning that SQL queries act on individual rows rather than batches of rows, as in vectorized query processing, giving poorer performances for OLAP queries.

To sum up, they both share many characteristics. However, DuckDB is optimized for queries that apply aggregate calculations across large numbers of rows, whereas SQLite is optimized for fast scanning and lookup for individual rows of data. If you need to perform analytical queries, go with DuckDB otherwise, use SQLite.

DuckDB with Python

Time to practice! We are going to perform data analysis on the Stock Market dataset. This dataset contains CSV and JSON files for all NASDAQ, S&P500, and NYSE-listed companies. Those files describe the data, volume, high, low, and closing price. It’s a 10.22GB uncompressed dataset, and 1.0GB zip compressed. Let’s see what we can find out!

Ready? Let’s go!

The Setup

First, download the data and unzip it.

dataset structure
Dataset structure

Then, create a folder and open Visual Studio Code with a Terminal. Create a Python virtual environment as shown below:

Python Virtual Environment

Activate the Python virtual environment with the command source ./venv/bin/activate

Now, create a new file “requirements.txt” with the following Python libraries:

duckdb==0.6.0
polars==0.14.31
SQLAlchemy==1.4.44
duckdb-engine==0.6.5

Those additional libraries will help to simplify SQL query development with the Notebook. In your terminal, run the following command to install them:

pip install -r requirements.txt

Create a new file, stock_analysis.py, with the following content:

# %%
import duckdb
import polars as pl
import sqlalchemy

VSCode is going to generate a Notebook cell automatically. Exactly like with Jupyter. Run the cell with SHIFT + ENTER, then click on “Install.” (Make sure it connects to your virtual environment venv as shown at the top right corner).

Connect to your virtual environment

Ok, everything is ready. Let’s load the data!

Loading data with DuckDB

First, create a connection with duckdb.connect(). That function returns a connection to an in-memory database. That means all data will be lost when we exit the Python process where Duckdb runs. Unless with export the data in a compressed file format (we will see that later).

conn = duckdb.connect()

If you want to persist the data, you can create a database file with:

conn = duckdb.connect(database='db.duckdb', read_only=False)

Notice the read_only parameter that allows multiple Python processes to access the same database file simultaneously. That works only with the database file.

The cool thing with DuckDB is that you can read CSV files immediately in the SQL query. DuckDB provides different ways of importing data in the DB. It can load data from CSV, Parquet, JSON files etc. It integrates Apache Arrow and has very nice extensions, such as HTTPFS to query files over HTTP(S). For CSV files, you can use read_csv_auto or read_csv. read_csv_auto automatically attempts to infer the correct configuration to read your CSV files. No need to specify the header, the column types, etc. You should always try it first. However, we will use read_csv since one of the CSV files has an issue with the automatically inferred data type for a column by DuckDB. Also, specifying the column types will make the query faster to complete.

Let’s count how many rows we have 👇

conn.execute("""
SELECT COUNT(*)
FROM read_csv('stock_market_data/nasdaq/csv/*.csv', header=True, dateformat='%d-%m-%Y', columns={'Date': 'DATE', 'Low': 'DOUBLE', 'Open': 'DOUBLE', 'Volume': 'BIGINT', 'High': 'DOUBLE', 'Close': 'DOUBLE', 'AdjustedClose': 'DOUBLE'}, filename=True)
""").fetchall()
# [(8753175,)]

As you can see, the dataset has over 8+ million rows. We can get the output of that query with fetchall(). That method retrieves the results of a query by transforming the data from DuckDB back into Python objects. Here, we got a list of tuples. That can be inefficient since individual Python objects must be created for every value in the result set. Think of a read on a large table. Instead, you can use other methods, such as df() or arrow(), to get the result in a Pandas Dataframe or an Arrow table.

Loading data with DuckDB and Parquet

Parquet is a columnar file format that provides efficient storage and fast reading speed. It is super effective at minimizing table scans and compressing datasets to small sizes on disk. By converting CSV files to Parquet files, we can dramatically speed up queries to load and process them with DuckDB. More on Apache Parquet here.

Create folder parquet under stock_market_data/nasdaq/ then we can convert all CSV files to Parquet with this simple loop:

import glob
PATH = 'stock_market_data/nasdaq'
for filename in glob.iglob(f'{PATH}/csv/*.csv'):
dest = f'{PATH}/parquet/{filename.split("/")[-1][:-4]}.parquet'
conn.execute(f"""
COPY (SELECT * FROM read_csv('{filename}', header=True, dateformat='%d-%m-%Y', columns={{'Date': 'DATE', 'Low': 'DOUBLE', 'Open': 'DOUBLE', 'Volume': 'BIGINT', 'High': 'DOUBLE', 'Close': 'DOUBLE', 'AdjustedClose': 'DOUBLE'}}, filename=True)) 
TO '{dest}' (FORMAT 'parquet')""")

You should get the following files

Parquet files

Let’s count the row again but this time using Parquet files:

conn.execute("""
SELECT COUNT(*)
FROM 'stock_market_data/nasdaq/parquet/*.parquet'
""").fetchall()

That query took 700ms to complete, whereas it took 5 seconds with CSV files! Nice, isn’t it? 😎

What’s even more impressive is the size on-disk those files take. CSV files take 800+ MB, whereas they take 200MB with Parquet.

Performing data analysis with DuckDB

We can now perform analysis with the data compressed in Parquet files. Since DuckDB knows how to read Parquet files efficiently, we don’t need to load all the data in a table. We can create a View to avoid copying all the data to memory to query the Parquet files.

conn.execute("""
CREATE VIEW stocks AS
SELECT * FROM 'stock_market_data/nasdaq/parquet/*.parquet'
""")

With that, you can perform queries on your Parquet files, exactly if the data was loaded in a table.

Let’s take a look at the first 10 rows

conn.execute("""
SELECT Date FROM stocks
LIMIT 10
""").fetchall()
# output:
[(datetime.date(1980, 3, 17),  2.5,  0.0,  6525,  2.722222089767456,  2.5,  1.0753743648529053,  'stock_market_data/nasdaq/csv/KEQU.csv'),
(datetime.date(1980, 3, 18),  2.5,  2.5,  0,  2.5,  2.5,  1.0753743648529053,  'stock_market_data/nasdaq/csv/KEQU.csv'),
...]

Remember that fetchall() returns Python objects. For a better representation, you can use df() to return a Pandas Dataframe.

Next, let’s find out the top 10 penny stocks on the most recent date:

DuckDB in action!

Leave a Comment

Your email address will not be published.