GPU Accelerated Polars — Intuitively and Exhaustively Explained

Fast Dataframes for Big Problems

Sep 17, 2024

“Accelerated Polar” by Daniel Warfield using Midjourney. All images by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained.

I was recently in a secret demo run by the Cuda and Polars team. They passed me through a metal detector, put a bag over my head, and drove me to a shack in the woods of rural France. They took my phone, wallet, and passport to ensure I wouldn’t spill the beans before finally showing off what they’ve been working on.

Or, that’s what it felt like. In reality it was a zoom meeting where they politely asked me not to say anything until a specified time, but as a tech writer the mystery had me feeling a little like James Bond.

In this article we’ll discuss the content of that meeting: a new execution engine in Polars that enables GPU accelerated computation, allowing for interactive data manipulation of 100GB+ of data. We’ll discuss what a data frame is in polars, how GPU acceleration works with polars dataframes, and how much of a boost to performance one can expect with the new CUDA powered execution engine.

Who is this useful for? Anyone who works with data and wants to work faster.

How advanced is this post? This post contains simple but cutting-edge data engineering concepts. It’s relevant to readers of all levels.

Pre-requisites: None

Note: At the time of writing I am not affiliated with or endorsed by Polars or Nvidia in any way.

Polars In a Nutshell

In polars you can create and manipulate data frames (which are like super powerful spreadsheets). Here I’m making a simple dataframe consisting of a few people, their age, and the city they live in.

""" Creating a simple dataframe in polars
"""
import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Jill", "William"],
    "age": [25, 30, 35, 22, 40],
    "city": ["New York", "Los Angeles", "Chicago", "New York", "Chicago"]
})

print(df)

shape: (5, 3)
┌─────────┬─────┬─────────────┐
│ name    ┆ age ┆ city        │
│ ---     ┆ --- ┆ ---         │
│ str     ┆ i64 ┆ str         │
╞═════════╪═════╪═════════════╡
│ Alice   ┆ 25  ┆ New York    │
│ Bob     ┆ 30  ┆ Los Angeles │
│ Charlie ┆ 35  ┆ Chicago     │
│ Jill    ┆ 22  ┆ New York    │
│ William ┆ 40  ┆ Chicago     │
└─────────┴─────┴─────────────┘

Using this dataframe, you can do things like filter by age,

""" Filtering the previously defined dataframe to only show rows that have
an age of over 28
"""
df_filtered = df.filter(pl.col("age") > 28)
print(df_filtered)

shape: (3, 3)
┌─────────┬─────┬─────────────┐
│ name    ┆ age ┆ city        │
│ ---     ┆ --- ┆ ---         │
│ str     ┆ i64 ┆ str         │
╞═════════╪═════╪═════════════╡
│ Bob     ┆ 30  ┆ Los Angeles │
│ Charlie ┆ 35  ┆ Chicago     │
│ William ┆ 40  ┆ Chicago     │
└─────────┴─────┴─────────────┘

you can do math,

""" Creating a new column called "age_doubled" which is double the age
column.
"""
df = df.with_columns([
    (pl.col("age") * 2).alias("age_doubled")
])

print(df)

shape: (5, 4)
┌─────────┬─────┬─────────────┬─────────────┐
│ name    ┆ age ┆ city        ┆ age_doubled │
│ ---     ┆ --- ┆ ---         ┆ ---         │
│ str     ┆ i64 ┆ str         ┆ i64         │
╞═════════╪═════╪═════════════╪═════════════╡
│ Alice   ┆ 25  ┆ New York    ┆ 50          │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 60          │
│ Charlie ┆ 35  ┆ Chicago     ┆ 70          │
│ Jill    ┆ 22  ┆ New York    ┆ 44          │
│ William ┆ 40  ┆ Chicago     ┆ 80          │
└─────────┴─────┴─────────────┴─────────────┘

and you can perform aggregate functions, like calculating the average age in a city.

""" Calculating the average age by city
"""
df_aggregated = df.group_by("city").agg(pl.col("age").mean())
print(df_aggregated)

shape: (3, 2)
┌─────────────┬──────┐
│ city        ┆ age  │
│ ---         ┆ ---  │
│ str         ┆ f64  │
╞═════════════╪══════╡
│ Chicago     ┆ 37.5 │
│ New York    ┆ 23.5 │
│ Los Angeles ┆ 30.0 │
└─────────────┴──────┘

Most people reading this are probably familiar with Pandas, which is Python’s more popular dataframe library. I think, before we get into GPU accelerated polars, it might be useful to explore an important feature which differentiates Polars from Pandas.

Polars LazyFrames

Polars has two fundamental execution modes, “eager” and “lazy”. An eager dataframe does calculations as soon as they are called, exactly as they are called. If you add 2 to every value in a column, then add 3 to every value in that column, each of those operations will execute just like how you would expect when using an eager dataframe. Every value gets 2 added to it, then each of those values gets 3 added to it, right at the exact moment you tell polars those operations should happen.

import polars as pl

# Create a DataFrame with a list of numbers
df = pl.DataFrame({
    "numbers": [1, 2, 3, 4, 5]
})

# Add 2 to every number and overwrite the original 'numbers' column
df = df.with_columns(
    pl.col("numbers") + 2
)

# Add 3 to the updated 'numbers' column
df = df.with_columns(
    pl.col("numbers") + 3
)

print(df)

shape: (5, 1)
┌─────────┐
│ numbers │
│ ---     │
│ i64     │
╞═════════╡
│ 6       │
│ 7       │
│ 8       │
│ 9       │
│ 10      │
└─────────┘

If we instead initialize our dataframe with the .lazy() function, we’ll get a very different output.

import polars as pl

# Create a lazy DataFrame with a list of numbers
df = pl.DataFrame({
    "numbers": [1, 2, 3, 4, 5]
}).lazy() # <-------------------------- Lazy Initialization

# Add 2 to every number and overwrite the original 'numbers' column
df = df.with_columns(
    pl.col("numbers") + 2
)

# Add 3 to the updated 'numbers' column
df = df.with_columns(
    pl.col("numbers") + 3
)

print(df)

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

 WITH_COLUMNS:
 [[(col("numbers")) + (3)]]
   WITH_COLUMNS:
   [[(col("numbers")) + (2)]]
    DF ["numbers"]; PROJECT */1 COLUMNS; SELECTION: "None"

Instead of getting a dataframe, we get an SQL-like expression which outlines what operations need to happen to get the dataframe that we want. We can call .collect() on this to actually run those calculations and get our dataframe.

print(df.collect())

shape: (5, 1)
┌─────────┐
│ numbers │
│ ---     │
│ i64     │
╞═════════╡
│ 6       │
│ 7       │
│ 8       │
│ 9       │
│ 10      │
└─────────┘

At first blush this might not seem useful: Who cares if we do all our calculations at the end versus throughout the code? Noone, really. The benefit of this system isn’t when calculations happen, it’s what calculations happen.

Before executing a lazy dataframe, Polars looks at the operations that have stacked up and figures out any shortcuts that could result in faster execution. This process, generally, is called “query optimization”. For instance, if we whip up a lazy dataframe then run some operations on this data we’ll get some SQL expression

# Create a DataFrame with a list of numbers
df = pl.DataFrame({
    "col_0": [1, 2, 3, 4, 5],
    "col_1": [8, 7, 6, 5, 4],
    "col_2": [-1, -2, -3, -4, -5]
}).lazy()

#doing some random operations
df = df.filter(pl.col("col_0") > 0)
df = df.with_columns((pl.col("col_1") * 2).alias("col_1_double"))
df = df.group_by("col_2").agg(pl.sum("col_1_double"))

print(df)

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

AGGREGATE
 [col("col_1_double").sum()] BY [col("col_2")] FROM
   WITH_COLUMNS:
   [[(col("col_1")) * (2)].alias("col_1_double")]
    FILTER [(col("col_0")) > (0)] FROM

    DF ["col_0", "col_1", "col_2"]; PROJECT */3 COLUMNS; SELECTION: "None"

But if we run .explain(optimized=True) on that dataframe we’ll get a different expression, which is what polars has deemed to be a more optimal way of doing the same thing.

print(df.explain(optimized=True))

AGGREGATE
 [col("col_1_double").sum()] BY [col("col_2")] FROM
   WITH_COLUMNS:
   [[(col("col_1")) * (2)].alias("col_1_double")]
    DF ["col_0", "col_1", "col_2"]; PROJECT */3 COLUMNS; SELECTION: "[(col(\"col_0\")) > (0)]"

This optimized expression is what actually gets run when you call .collect() on a lazy dataframe.

This isn’t just fancy and fun, it can result in some pretty serious performance increases. Here I’m running the same operations on two identical dataframes, one which is eager and one which is lazy. I’m averaging the execution time over 10 runs and calculating the average speed difference.

"""Performing the same operations on the same data between two dataframes,
one with eager execution and one with lazy execution, and calculating the
difference in execution speed.
"""

import polars as pl
import numpy as np
import time

# Specifying constants
num_rows = 20_000_000  # 20 million rows
num_cols = 10          # 10 columns
n = 10  # Number of times to repeat the test

# Generate random data
np.random.seed(0)  # Set seed for reproducibility
data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_cols)}

# Define a function that works for both lazy and eager DataFrames
def apply_transformations(df):
    df = df.filter(pl.col("col_0") > 0)  # Filter rows where col_0 is greater than 0
    df = df.with_columns((pl.col("col_1") * 2).alias("col_1_double"))  # Double col_1
    df = df.group_by("col_2").agg(pl.sum("col_1_double"))  # Group by col_2 and aggregate
    return df

# Variables to store total durations for eager and lazy execution
total_eager_duration = 0
total_lazy_duration = 0

# Perform the test n times
for i in range(n):
    print(f"Run {i+1}/{n}")

    # Create fresh DataFrames for each run (polars operations can be in-place, so ensure clean DF)
    df1 = pl.DataFrame(data)
    df2 = pl.DataFrame(data).lazy()

    # Measure eager execution time
    start_time_eager = time.time()
    eager_result = apply_transformations(df1)  # Eager execution
    eager_duration = time.time() - start_time_eager
    total_eager_duration += eager_duration
    print(f"Eager execution time: {eager_duration:.2f} seconds")

    # Measure lazy execution time
    start_time_lazy = time.time()
    lazy_result = apply_transformations(df2).collect()  # Lazy execution
    lazy_duration = time.time() - start_time_lazy
    total_lazy_duration += lazy_duration
    print(f"Lazy execution time: {lazy_duration:.2f} seconds")

# Calculating the average execution time
average_eager_duration = total_eager_duration / n
average_lazy_duration = total_lazy_duration / n

#calculating how much faster lazy execution was
faster = (average_eager_duration-average_lazy_duration)/average_eager_duration*100

print(f"\nAverage eager execution time over {n} runs: {average_eager_duration:.2f} seconds")
print(f"Average lazy execution time over {n} runs: {average_lazy_duration:.2f} seconds")
print(f"Lazy took {faster:.2f}% less time")

Run 1/10
Eager execution time: 3.07 seconds
Lazy execution time: 2.70 seconds
Run 2/10
Eager execution time: 4.17 seconds
Lazy execution time: 2.69 seconds
Run 3/10
Eager execution time: 2.97 seconds
Lazy execution time: 2.76 seconds
Run 4/10
Eager execution time: 4.21 seconds
Lazy execution time: 2.74 seconds
Run 5/10
Eager execution time: 2.97 seconds
Lazy execution time: 2.77 seconds
Run 6/10
Eager execution time: 4.12 seconds
Lazy execution time: 2.80 seconds
Run 7/10
Eager execution time: 3.00 seconds
Lazy execution time: 2.72 seconds
Run 8/10
Eager execution time: 4.53 seconds
Lazy execution time: 2.76 seconds
Run 9/10
Eager execution time: 3.14 seconds
Lazy execution time: 3.08 seconds
Run 10/10
Eager execution time: 4.26 seconds
Lazy execution time: 2.77 seconds

Average eager execution time over 10 runs: 3.64 seconds
Average lazy execution time over 10 runs: 2.78 seconds
Lazy took 23.75% less time

A 23.75% performance increase is nothing to sneeze at, and is enabled by lazy execution (which is not present in Pandas). Under the covers, when you use a polars lazy dataframe you’re essentially defining a high level computational graph, which polars takes and does all sorts of fancy magic to. After it optimizes that query it then executes it, meaning you get the same results you would have gotten with an eager dataframe, but usually faster.

A high level breakdown of the things that kick in after you call your query in polars. It’s worth noting, the eager execution itself has many optimization improvements, like native multicore support, which is present in and improved upon with lazy execution.

Personally, I’m a Pandas guy, and haven’t really seen a big reason to make the switch. I figured “It might be better, but it’s probably not better enough for me to switch one of my most fundamental tools”. If that sounds like you, and 23.75% doesn’t get your eyebrows ticked, boy do I have some results for you.

Introducing Polars with GPU Execution

The functionality for this is hot off the press, so I’m not 100% sure how you will be able to leverage GPU acceleration in your environment. At the time of writing I was given a wheel file (which is like a local library you can install). I’m under the impression that, to get rocking in rolling on your machine after this article has been released, you can use the following command to install polars with GPU acceleration on your machine.

pip install polars[gpu] --extra-index-url=https://pypi.nvidia.com

I also expect there might be some documentation on the polars pypi page if that doesn’t work. Regardless, once your up and running, you can start using the power of polars on your GPU. The only thing you have to do to your code is specify the GPU as the engine when you collect a lazy dataframe.

On top of comparing eager and lazy execution from the previous test, let’s also compare lazy execution with the GPU engine. We can do that with the line results = df.collect(engine=gpu_engine) where gpu_engine is specified based on the following:

gpu_engine = pl.GPUEngine(
        device=0, # This is the default
        raise_on_fail=True, # Fail loudly if we can't run on the GPU.
    )

The GPU execution engine doesn’t support all polars functionality, and will fall back to the CPU by default. By setting raise_on_fail=True we’re specifying that we want the code to raise an exception if GPU execution is not supported. Ok, here’s the actual code.

"""Performing the same operations on the same data between three dataframes,
one with eager execution, one with lazy execution, and one with lazy execution
and GPU acceleration. Calculating the difference in execution speed between the
three.
"""

import polars as pl
import numpy as np
import time

# Creating a large random DataFrame
num_rows = 20_000_000  # 20 million rows
num_cols = 10          # 10 columns
n = 10  # Number of times to repeat the test

# Generate random data
np.random.seed(0)  # Set seed for reproducibility
data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_cols)}

# Defining a function that works for both lazy and eager DataFrames
def apply_transformations(df):
    df = df.filter(pl.col("col_0") > 0)  # Filter rows where col_0 is greater than 0
    df = df.with_columns((pl.col("col_1") * 2).alias("col_1_double"))  # Double col_1
    df = df.group_by("col_2").agg(pl.sum("col_1_double"))  # Group by col_2 and aggregate
    return df

# Variables to store total durations for eager and lazy execution
total_eager_duration = 0
total_lazy_duration = 0
total_lazy_GPU_duration = 0

# Performing the test n times
for i in range(n):
    print(f"Run {i+1}/{n}")

    # Create fresh DataFrames for each run (polars operations can be in-place, so ensure clean DF)
    df1 = pl.DataFrame(data)
    df2 = pl.DataFrame(data).lazy()
    df2 = pl.DataFrame(data).lazy()

    # Measure eager execution time
    start_time_eager = time.time()
    eager_result = apply_transformations(df1)  # Eager execution
    eager_duration = time.time() - start_time_eager
    total_eager_duration += eager_duration
    print(f"Eager execution time: {eager_duration:.2f} seconds")

    # Measure lazy execution time
    start_time_lazy = time.time()
    lazy_result = apply_transformations(df2).collect()  # Lazy execution
    lazy_duration = time.time() - start_time_lazy
    total_lazy_duration += lazy_duration
    print(f"Lazy execution time: {lazy_duration:.2f} seconds")

    # Defining GPU Engine
    gpu_engine = pl.GPUEngine(
        device=0, # This is the default
        raise_on_fail=True, # Fail loudly if we can't run on the GPU.
    )

    # Measure lazy execution time
    start_time_lazy_GPU = time.time()
    lazy_result = apply_transformations(df2).collect(engine=gpu_engine)  # Lazy execution with GPU
    lazy_GPU_duration = time.time() - start_time_lazy_GPU
    total_lazy_GPU_duration += lazy_GPU_duration
    print(f"Lazy execution time: {lazy_GPU_duration:.2f} seconds")

# Calculating the average execution time
average_eager_duration = total_eager_duration / n
average_lazy_duration = total_lazy_duration / n
average_lazy_GPU_duration = total_lazy_GPU_duration / n

#calculating how much faster lazy execution was
faster_1 = (average_eager_duration-average_lazy_duration)/average_eager_duration*100
faster_2 = (average_lazy_duration-average_lazy_GPU_duration)/average_lazy_duration*100
faster_3 = (average_eager_duration-average_lazy_GPU_duration)/average_eager_duration*100

print(f"\nAverage eager execution time over {n} runs: {average_eager_duration:.2f} seconds")
print(f"Average lazy execution time over {n} runs: {average_lazy_duration:.2f} seconds")
print(f"Average lazy execution time over {n} runs: {average_lazy_GPU_duration:.2f} seconds")
print(f"Lazy was {faster_1:.2f}% faster than eager")
print(f"GPU was {faster_2:.2f}% faster than CPU Lazy and {faster_3:.2f}% faster than CPU eager")

Run 1/10
Eager execution time: 0.74 seconds
Lazy execution time: 0.66 seconds
Lazy execution time: 0.17 seconds
Run 2/10
Eager execution time: 0.72 seconds
Lazy execution time: 0.65 seconds
Lazy execution time: 0.17 seconds
Run 3/10
Eager execution time: 0.82 seconds
Lazy execution time: 0.76 seconds
Lazy execution time: 0.17 seconds
Run 4/10
Eager execution time: 0.81 seconds
Lazy execution time: 0.69 seconds
Lazy execution time: 0.18 seconds
Run 5/10
Eager execution time: 0.79 seconds
Lazy execution time: 0.66 seconds
Lazy execution time: 0.18 seconds
Run 6/10
Eager execution time: 0.75 seconds
Lazy execution time: 0.63 seconds
Lazy execution time: 0.18 seconds
Run 7/10
Eager execution time: 0.77 seconds
Lazy execution time: 0.72 seconds
Lazy execution time: 0.18 seconds
Run 8/10
Eager execution time: 0.77 seconds
Lazy execution time: 0.72 seconds
Lazy execution time: 0.17 seconds
Run 9/10
Eager execution time: 0.77 seconds
Lazy execution time: 0.72 seconds
Lazy execution time: 0.17 seconds
Run 10/10
Eager execution time: 0.77 seconds
Lazy execution time: 0.70 seconds
Lazy execution time: 0.17 seconds

Average eager execution time over 10 runs: 0.77 seconds
Average lazy execution time over 10 runs: 0.69 seconds
Average lazy execution time over 10 runs: 0.17 seconds
Lazy was 10.30% faster than eager
GPU was 74.78% faster than CPU Lazy and 77.38% faster than CPU eager

(Note: this is a similar test as the previous test, but on a different, larger machine. Thus, the execution times are different from the previous test)

Yep. 74.78% faster . And this isn’t even a particularly large dataset. One might expect larger performance increases for larger datasets.

Unfortunately, I can’t share the presentation provided by the Nvidia and Polars team, but I can describe what I understand to be happening under the hood. Basically, Polars has a few execution engines it uses for a variety of tasks, and they basically just tacked on another one with GPU support.

After you put in a bunch of queries, the query optimizer optimizes the query and sends operations to one of a variety of execution engines. A new one, now, is a GPU powered execution engine.

As far as I understand, these engines are called on the fly, both based on the hardware available and the query being executed. Some queries are highly parallelizable and do exceedingly well on the GPU, while less parallelizable operations might be done with the In-Memory engine on the CPU. This, in theory, makes CUDA accelerated polars almost always faster, which I have found to be anecdotally true, with especially large speedups on larger datasets.

Abstracted Memory Management

One of the key ideas the Nvidia team presented was that the new query optimizer was clever enough to handle memory management across the CPU and GPU. For those of you that don’t have a lot of GPU programming experience, the CPU and GPU have different memory, with the CPU utilizing RAM to store information and the GPU utilizing vRAM which is housed on the GPU itself.

The CPU and the GPU are kind of like separate computers, each with their own resources, which talk to one another. The CPU does calculations and it’s RAM stores data, while the GPU also does calculations and it’s vRAM on the graphics card also stores data. These separate and somewhat autonomous systems need to work together on complex tasks. This image is from my article on CUDA programming for AI.

One can imagine a scenario where a polars dataframe is created and executed on with the GPU execution engine. Then, one can imagine an operation which requires that dataframe to interact with another dataframe which is still on the CPU. The polars query optimizer is capable of understanding and handling this discrepancy by passing data between the CPU and GPU as needed.

To me this is a tremendous asset with a thorny inconvenience. When you’re doing big and heavy workloads (building AI models for instance) on the GPU, it’s often important to manage memory consumption fairly strictly. I imagine workflows featuring large models which take up a substantial portion of the GPU might have issues accommodating polars essentially doing whatever it wants with your data. I noticed that the GPU execution engine seems to be directed to a single GPU, so maybe machines with more than one GPU can isolate memory more strictly.

Regardless though, I think this is of little practical concern for most people. Generally speaking when doing data science stuff you start with raw data, do a bunch of data manipulation, then save artifacts which are designed to be model-ready. I struggle to think of a use case where you have to do big data manipulations and modeling jobs on the same machine at the same time. If that does seem like a big problem to you, though, the Nvidia and Polars team is currently investigating explicit memory control, which may be present in future releases. For pure data manipulation workloads, I imagine the automatic handling of RAM and vRAM will be a tremendous time saver to many data scientists and engineers out there.

Conclusion

Exciting stuff, and hot off the press at that. Normally I spend weeks reviewing topics that have been established for months if not years, so a “pre-release” is a bit new for me.

Frankly, I have no idea how impactful this will be on my general workflow. I’ll likely still be rocking pandas for a lot of my cowboy coding in google colab, simply because I’m comfortable with it, but when faced with big dataframes and computationally expensive queries I think I’ll find myself reaching for Polars more often, and I expect to eventually integrate it into a core part of my workflow. A big reason for that is the astronomical speed increase of GPU accelerated Polars over virtually any other dataframe tooling.

Join Intuitively and Exhaustively Explained

At IAEE you can find:

Long form content, like the article you just read
Thought pieces, based on my experience as a data scientist, engineering director, and entrepreneur
A discord community focused on learning AI
Regular Lectures and office hours

Join the Discord!

Mohammed Osman

Sep 18·edited Sep 18

I had learned polars before but didn't go too far in using it because I was already running RAPIDS cuDF which used the GPU. now Polars uses the GPU so things are converging. The lazy method should put polars+GPU over RAPIDS cudf now. I think it really matters when you're running massive datasets

Expand full comment

1 reply by Daniel Warfield

1 more comment...

Intuitively and Exhaustively Explained