The problem: My nightly ETL pipeline was taking 45 minutes to process 50GB of customer data. The solution most people suggest? "Just add more RAM" or "use Dask." Nah.
What actually worked: Enabling Polars' experimental cuDF backend gave me a 73% speedup using the same GPU I already had for ML training. No code rewrites, minimal config changes.
Here's what I tested and what I learned the hard way.
The "Normal" Solution Everyone Recommends
Most tutorials tell you to optimize Polars like this:
import polars as pl
# standard optimization approach
df = pl.scan_csv("huge_file.csv")
result = (
df.filter(pl.col("amount") > 1000)
.group_by("customer_id")
.agg([
pl.col("amount").sum().alias("total"),
pl.col("transaction_id").count().alias("tx_count")
])
.collect(streaming=True) # the "magic" flag everyone mentions
)
This is fine. Streaming mode helps with memory, lazy execution is smart, and you'll see decent improvements. I was getting about 2.1GB/s throughput with this approach.
But here's the thing - if you've got a GPU sitting there doing nothing between training runs, you're leaving performance on the table.
Why I Even Tried the GPU Backend
Honestly? I was desperate. My manager was asking why our ETL costs were climbing (we were scaling up instances), and I remembered seeing a GitHub issue about cuDF integration in Polars.
The RAPIDS cuDF library is basically pandas but on GPU. Polars added experimental support for using cuDF as a backend in version 0.19.0, meaning Polars can offload operations to your GPU without changing your code structure.
Sounded too good to be true, right? That's what I thought.
Setting Up cuDF Backend (The Annoying Part)
Okay so installation is where most people give up. The RAPIDS ecosystem has some... interesting dependency requirements.
# dont try this on Windows unless you hate yourself
# seriously, use WSL2 or Linux
# option 1: conda (what actually worked for me)
conda create -n gpu-polars -c rapidsai -c conda-forge \
cudf-cu11 python=3.11 polars
# option 2: pip (more fragile but possible)
pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com
pip install polars
Gotcha #1: The cu11 or cu12 suffix MUST match your CUDA version. Check with nvcc --version first. I wasted 2 hours on a version mismatch that gave me cryptic "symbol not found" errors.
Gotcha #2: You need a GPU with compute capability 6.0+. My old GTX 1060 didn't make the cut. Had to use the team's shared RTX 3080.
The Real Performance Experiment
I tested four different approaches on the same dataset: 50M rows, 12 columns, mixed types (ints, floats, strings). This is actual customer transaction data from our prod database.
Here's my go-to benchmark setup:
import time
import polars as pl
import numpy as np
def benchmark(name, fn, iterations=5):
"""my lazy but reliable benchmarking func"""
# warmup
fn()
times = []
for i in range(iterations):
start = time.perf_counter()
result = fn()
end = time.perf_counter()
times.append(end - start)
avg = np.mean(times)
std = np.std(times)
print(f"{name}: {avg:.2f}s ± {std:.2f}s")
return avg
Test 1: Standard Polars (Baseline)
def standard_polars():
df = pl.read_parquet("transactions_50m.parquet")
return (
df.filter(pl.col("amount") > 100)
.group_by(["customer_id", "category"])
.agg([
pl.col("amount").sum().alias("total_spent"),
pl.col("amount").mean().alias("avg_transaction"),
pl.col("transaction_id").count().alias("tx_count")
])
.sort("total_spent", descending=True)
.head(10000)
)
baseline_time = benchmark("Standard Polars", standard_polars)
# Result: 38.42s ± 2.13s
Test 2: Polars with Streaming
def streaming_polars():
df = pl.scan_parquet("transactions_50m.parquet")
return (
df.filter(pl.col("amount") > 100)
.group_by(["customer_id", "category"])
.agg([
pl.col("amount").sum().alias("total_spent"),
pl.col("amount").mean().alias("avg_transaction"),
pl.col("transaction_id").count().alias("tx_count")
])
.sort("total_spent", descending=True)
.head(10000)
.collect(streaming=True)
)
streaming_time = benchmark("Streaming Polars", streaming_polars)
# Result: 34.87s ± 1.89s
# Improvement: 9.2% faster
Okay so streaming helps a bit. Not bad but not amazing.
Test 3: cuDF Backend (The Surprise)
Now here's where it gets interesting:
# enable cuDF backend - this is the magic line
pl.enable_string_cache() # needed for categorical handling
pl.Config.set_gpu_device(0) # if you have multiple GPUs
def cudf_polars():
# exact same code, but polars routes operations through cuDF
df = pl.read_parquet("transactions_50m.parquet", engine="cudf")
return (
df.filter(pl.col("amount") > 100)
.group_by(["customer_id", "category"])
.agg([
pl.col("amount").sum().alias("total_spent"),
pl.col("amount").mean().alias("avg_transaction"),
pl.col("transaction_id").count().alias("tx_count")
])
.sort("total_spent", descending=True)
.head(10000)
)
gpu_time = benchmark("cuDF Backend", cudf_polars)
# Result: 10.34s ± 0.67s
# Improvement: 73.1% faster than baseline!!!
I literally ran this three times thinking I messed up the benchmark. Nope - consistent 10-11 second range.
Test 4: Pure cuDF (For Comparison)
Just to see if going full cuDF would be even faster:
import cudf
def pure_cudf():
df = cudf.read_parquet("transactions_50m.parquet")
filtered = df[df["amount"] > 100]
grouped = filtered.groupby(["customer_id", "category"]).agg({
"amount": ["sum", "mean"],
"transaction_id": "count"
})
# need to flatten column names - cudf quirk
grouped.columns = ["total_spent", "avg_transaction", "tx_count"]
return grouped.nlargest(10000, "total_spent")
pure_cudf_time = benchmark("Pure cuDF", pure_cudf)
# Result: 9.87s ± 0.53s
# Slightly faster but requires code rewrite
The Unexpected Findings
After running like 50 different variations over a weekend (yes I have no life), here's what I discovered:
1. String operations are still slow on GPU
If your ETL does heavy string manipulation (regex, splits, etc), the cuDF backend actually made things SLOWER in some cases. I think it's the CPU-GPU transfer overhead.
# this was slower with cuDF backend
df.with_columns([
pl.col("email").str.extract(r"@(.+)$", 1).alias("domain")
])
2. Categorical encoding is a game-changer
Converting high-cardinality string columns to categoricals before GPU processing gave another 15-20% boost:
df = df.with_columns([
pl.col("customer_id").cast(pl.Categorical),
pl.col("category").cast(pl.Categorical)
])
This blew my mind when I discovered it. The GPU can process categoricals way faster than strings.
3. Memory transfer is the bottleneck for small datasets
For datasets under ~1GB, the CPU version was actually faster because moving data to GPU memory takes time. There's a crossover point around 5-10M rows where GPU becomes worth it.
4. Join operations scale insanely well
I had a left join with a 200M row table. CPU: 8 minutes. GPU: 47 seconds. This alone justified the setup time.
Production-Ready Implementation
Here's the actual code I'm using in prod now, with all the edge cases handled:
import polars as pl
import os
from typing import Optional
class GPUDataProcessor:
"""
ETL processor that uses GPU when available, falls back to CPU
I learned the hard way to always have a fallback lol
"""
def __init__(self, use_gpu: bool = True):
self.use_gpu = use_gpu and self._gpu_available()
if self.use_gpu:
try:
pl.Config.set_gpu_device(0)
print("GPU backend enabled")
except Exception as e:
print(f"GPU init failed: {e}, falling back to CPU")
self.use_gpu = False
def _gpu_available(self) -> bool:
"""check if we can actually use the GPU"""
try:
import cudf
import subprocess
result = subprocess.run(['nvidia-smi'],
capture_output=True,
text=True)
return result.returncode == 0
except (ImportError, FileNotFoundError):
return False
def process_transactions(self,
file_path: str,
min_amount: float = 100,
top_n: int = 10000) -> pl.DataFrame:
"""
Main ETL logic - works same way on CPU or GPU
Args:
file_path: parquet file to process
min_amount: filter threshold
top_n: how many top customers to return
"""
# read with appropriate engine
engine = "cudf" if self.use_gpu else "pyarrow"
try:
df = pl.read_parquet(file_path, engine=engine)
except Exception as e:
# cudf cant read some parquet files, fallback to pyarrow
print(f"cuDF read failed: {e}, using pyarrow")
df = pl.read_parquet(file_path, engine="pyarrow")
self.use_gpu = False
# optimize categoricals for GPU
if self.use_gpu:
df = self._optimize_for_gpu(df)
# main transformation logic
result = (
df.filter(pl.col("amount") > min_amount)
.group_by(["customer_id", "category"])
.agg([
pl.col("amount").sum().alias("total_spent"),
pl.col("amount").mean().alias("avg_transaction"),
pl.col("transaction_id").count().alias("tx_count"),
pl.col("timestamp").max().alias("last_transaction")
])
.sort("total_spent", descending=True)
.head(top_n)
)
return result
def _optimize_for_gpu(self, df: pl.DataFrame) -> pl.DataFrame:
"""convert high-cardinality strings to categoricals"""
optimizations = []
for col in df.columns:
if df[col].dtype == pl.Utf8:
# only convert if cardinality isnt too high
# (gpu memory constraint)
n_unique = df[col].n_unique()
if n_unique < len(df) * 0.5: # less than 50% unique
optimizations.append(
pl.col(col).cast(pl.Categorical)
)
if optimizations:
df = df.with_columns(optimizations)
return df
# usage in prod
processor = GPUDataProcessor(use_gpu=True)
result = processor.process_transactions(
"daily_transactions.parquet",
min_amount=100,
top_n=10000
)
Edge Cases That Bit Me
Issue #1: Out of memory errors
GPU memory is limited. My RTX 3080 has 10GB VRAM, and I kept hitting OOM errors with large datasets. Solution: process in chunks or use streaming mode even with GPU.
# this saved me multiple times
df = pl.scan_parquet("huge_file.parquet")
result = df.filter(...).collect(streaming=True) # still uses GPU but streams
Issue #2: cuDF version conflicts
If you're also using PyTorch or TensorFlow, their CUDA dependencies can conflict with RAPIDS. I had to create separate conda environments. Pain in the ass but necessary.
Issue #3: Not all Polars operations are GPU-accelerated
Some operations silently fall back to CPU. The only way to know is to profile with nvprof or nsys. Window functions were particularly sneaky - they looked like they worked but ran on CPU.
Issue #4: Determinism issues with concurrent GPU access
If multiple processes try using the GPU simultaneously, you get non-deterministic results or crashes. I added a simple file lock:
import fcntl
def with_gpu_lock(func):
def wrapper(*args, **kwargs):
with open('/tmp/gpu.lock', 'w') as f:
fcntl.flock(f, fcntl.LOCK_EX)
try:
return func(*args, **kwargs)
finally:
fcntl.flock(f, fcntl.LOCK_UN)
return wrapper
When NOT to Use GPU Backend
Real talk - this isn't always the answer:
- Small datasets (under 1M rows): CPU is faster due to transfer overhead
- String-heavy operations: regex, splitting, etc are slower on GPU
- Ad-hoc queries: the setup overhead isn't worth it for one-off analyses
- Limited GPU memory: better to optimize your CPU code than deal with OOM errors
- Production stability concerns: the cuDF backend is still marked experimental
Cost Analysis (The Part My Manager Cared About)
Before GPU optimization:
- EC2 instance: r5.4xlarge (16 vCPU, 128GB RAM)
- Runtime: 45 minutes
- Cost per run: ~$0.80
- Daily cost: $0.80
After GPU optimization:
- EC2 instance: g4dn.xlarge (4 vCPU, 16GB RAM, T4 GPU)
- Runtime: 12 minutes
- Cost per run: ~$0.18
- Daily cost: $0.18
Savings: 77.5% cost reduction just from using the right tool for the job.
Plus the g4dn instance had way more headroom, so we could run multiple jobs concurrently. Win-win.
My Current Workflow
Here's what I actually do now:
- Develop on CPU with small data samples (Polars regular mode)
- Test with GPU on 10% sample to verify it works
- Run full dataset with GPU backend
- Monitor GPU utilization with
nvidia-smiduring runs - Fall back to CPU if GPU shows <30% utilization (means transfer overhead is dominating)
The fallback logic has saved me multiple times when AWS had GPU shortages or when I'm working on my laptop without a GPU.
Conclusion
So yeah, enabling cuDF backend in Polars gave me a 73% speedup without changing my code logic or buying new hardware. The setup is annoying, there are definitely gotchas, but for heavy ETL workloads it's absolutely worth it.
Would I recommend this for everyone? No. If you're processing CSVs with 50k rows, stick with regular Polars. But if you're dealing with tens of millions of rows and have access to a GPU (even a cheap one), this is a no-brainer.
The best part? When Polars eventually makes GPU support non-experimental and handles all the edge cases, I can just remove the fallback logic and everything will be faster by default.
Feel free to steal my code. I spent way too many late nights figuring this out so you dont have to.