Using Pandas, Polars, and DuckDB for the same task
As a regular Pandas user, I am usually satisfied with its performance. It can handle millions of records in minutes on my humble laptop, and I'm okay with waiting. However, new fancy tools are coming out, and I am curious to explore them.
I usually use IMDb datasets for testing and demos, so today, I undertook a familiar routine: loading each TSV (tab-separated values) file in a directory and saving it as Parquet. The only difference was that I did it with three different tools: Pandas, Polars, and DuckDB.
The IMDb's TSVs have their specifics:
- While they are mostly like normal CSV files, the separator is a tab instead of a comma.
- Numeric values (years) are mixed with '\N' values, which we can replace with nulls or change the columns' types to string (I go with the second option).
- Double quotes appear in column values, so we need to ignore them explicitly.
Pandas
I'm starting with Pandas as the most known tool. A simple version of the code looks like this:
The names.basic.tsv
contains 13.774.705
rows, and it takes 15-20 seconds
for my four-year-old Apple MacBook M1 to run the script.
Polars
Now, let's take a look at Polars.
The Polars API is intuitive after Pandas, and I had no problems using it.
The code appeared a little bit longer because there was no dtype
parameter in the read_csv
method to override all column types at once, but this is not a problem:
This code runs for 14-15 seconds.
DuckDB
With DuckDB, you can use SQL:
Alternatively, you can use the SDK methods:
In the last example, the table is not created: duckdb.read_csv
returns a lazy object that is only iterated when we call .to_parquet
,
which makes the execution even faster: around six seconds
if we create a table and four seconds if we don't.
The DuckDB's API is sometimes surprising. For example, read_csv
has sep
and delim
parameters for the same thing,
which is confusing when you scroll through the list of the method's parameters,
especially because the Python SDK has no docstrings. Also, when something goes wrong,
you see an output like this:
delimiter = (Set By User)
So you have sep
, delim
, but it's called delimiter
in the error message.
But there's nothing to complain about seriously — the API is rich, and the tool works great.
Loading all IMDb datasets
In the next step, I prepared scripts to load each IMDb TSV file in a directory and save them to Parquet. I'm not going to overload this article with code, but you can find the scripts here: https://github.com/romaklimenko/imdb
The performance results were surprising. I expected Polars to do the job faster than Pandas, but it was significantly slower. The average running times for each script are:
- Pandas:
196
seconds - Polars:
257
seconds - DuckDB:
46
seconds
DuckDB is way ahead of the others, but Pandas and Polars win and lose in different categories.
For example, title.principals.tsv
has 87.769.634
rows:
- Pandas: load TSV -
49
seconds, write to Parquet -32
seconds. - Polars: load TSV -
140
seconds, write to Parquet -10
seconds.
But take title.basics.tsv
, which is 11.054.773
rows:
- Pandas: load TSV -
14
seconds, write to Parquet -6
seconds. - Polars: load TSV -
6
seconds, write to Parquet -4
seconds.
Conclusion
Benchmarking tools on a single scenario is a thankless task. Sure, there are cases when Polars are faster than Pandas: https://duckdblabs.github.io/db-benchmark/, but the difference is not so obvious in "small" datasets of several millions of records.
I hope these products find their niche, as Pandas' and Polars' APIs are similar.
DuckDB shows impressive performance and solves many problems, whereas I'd have to run a database server otherwise. Its API is not very polished yet, but I can tolerate it for its good features.
I'm looking forward to using these new tools for my daily tasks.