home / skills / eyadsibai / ltk / polars

polars skill

safe

This skill speeds large CSV processing and ETL pipelines by leveraging Polars lazy evaluation, parallel execution, and expression-based queries.

npx playbooks add skill eyadsibai/ltk --skill polars

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

4.4 KB

---
name: polars
description: Use when "Polars", "fast dataframe", "lazy evaluation", "Arrow backend", or asking about "pandas alternative", "parallel dataframe", "large CSV processing", "ETL pipeline", "expression API"
version: 1.0.0
---

# Polars Fast DataFrame Library

Lightning-fast DataFrame library with lazy evaluation and parallel execution.

## When to Use

- Pandas is too slow for your dataset
- Working with 1-100GB datasets that fit in RAM
- Need lazy evaluation for query optimization
- Building ETL pipelines
- Want parallel execution without extra config

---

## Lazy vs Eager Evaluation

| Mode | Function | Executes | Use Case |
|------|----------|----------|----------|
| **Eager** | `read_csv()` | Immediately | Small data, exploration |
| **Lazy** | `scan_csv()` | On `.collect()` | Large data, pipelines |

**Key concept**: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).

---

## Core Operations

### Data Selection

| Operation | Purpose |
|-----------|---------|
| `select()` | Choose columns |
| `filter()` | Choose rows by condition |
| `with_columns()` | Add/modify columns |
| `drop()` | Remove columns |
| `head(n)` / `tail(n)` | First/last n rows |

### Aggregation

| Operation | Purpose |
|-----------|---------|
| `group_by().agg()` | Group and aggregate |
| `pivot()` | Reshape wide |
| `melt()` | Reshape long |
| `unique()` | Distinct values |

### Joins

| Join Type | Description |
|-----------|-------------|
| **inner** | Matching rows only |
| **left** | All left + matching right |
| **outer** | All rows from both |
| **cross** | Cartesian product |
| **semi** | Left rows with match |
| **anti** | Left rows without match |

---

## Expression API

**Key concept**: Polars uses expressions (`pl.col()`) instead of indexing. Expressions are lazily evaluated and optimized.

### Common Expressions

| Expression | Purpose |
|------------|---------|
| `pl.col("name")` | Reference column |
| `pl.lit(value)` | Literal value |
| `pl.all()` | All columns |
| `pl.exclude(...)` | All except |

### Expression Methods

| Category | Methods |
|----------|---------|
| **Aggregation** | `.sum()`, `.mean()`, `.min()`, `.max()`, `.count()` |
| **String** | `.str.contains()`, `.str.replace()`, `.str.to_lowercase()` |
| **DateTime** | `.dt.year()`, `.dt.month()`, `.dt.day()` |
| **Conditional** | `.when().then().otherwise()` |
| **Window** | `.over()`, `.rolling_mean()`, `.shift()` |

---

## Pandas Migration

| Pandas | Polars |
|--------|--------|
| `df['col']` | `df.select('col')` |
| `df[df['col'] > 5]` | `df.filter(pl.col('col') > 5)` |
| `df['new'] = df['col'] * 2` | `df.with_columns((pl.col('col') * 2).alias('new'))` |
| `df.groupby('col').mean()` | `df.group_by('col').agg(pl.all().mean())` |
| `df.apply(func)` | `df.map_rows(func)` (avoid if possible) |

**Key concept**: Polars prefers explicit operations over implicit indexing. Use `.alias()` to name computed columns.

---

## File I/O

| Format | Read | Write | Notes |
|--------|------|-------|-------|
| **CSV** | `read_csv()` / `scan_csv()` | `write_csv()` | Human readable |
| **Parquet** | `read_parquet()` / `scan_parquet()` | `write_parquet()` | Fast, compressed |
| **JSON** | `read_json()` / `scan_ndjson()` | `write_json()` | Newline-delimited |
| **IPC/Arrow** | `read_ipc()` / `scan_ipc()` | `write_ipc()` | Zero-copy |

**Key concept**: Use Parquet for performance. Use `scan_*` for large files to enable lazy optimization.

---

## Performance Tips

| Tip | Why |
|-----|-----|
| Use lazy mode | Query optimization |
| Use Parquet | Column-oriented, compressed |
| Select columns early | Projection pushdown |
| Filter early | Predicate pushdown |
| Avoid Python UDFs | Breaks parallelism |
| Use expressions | Vectorized operations |
| Set dtypes on read | Avoid inference overhead |

---

## vs Alternatives

| Tool | Best For | Limitations |
|------|----------|-------------|
| **Polars** | 1-100GB, speed critical | Must fit in RAM |
| **Pandas** | Small data, ecosystem | Slow, memory hungry |
| **Dask** | Larger than RAM | More complex API |
| **Spark** | Cluster computing | Infrastructure overhead |
| **DuckDB** | SQL interface | Different API style |

## Resources

- Docs: <https://pola.rs/>
- User Guide: <https://docs.pola.rs/user-guide/>
- Cookbook: <https://docs.pola.rs/user-guide/misc/cookbook/>

Overview

This skill provides guidance for using Polars, a lightning-fast DataFrame library with lazy evaluation and parallel execution. It focuses on when to pick Polars over other tools, how its lazy expression API works, and practical tips for high-performance ETL and data processing. Use it to migrate pandas workflows or to process large CSV/Parquet datasets efficiently.

How this skill works

Polars builds query plans in lazy mode (scan_*) that the optimizer rewrites before execution, applying predicate and projection pushdown to minimize work. Its core operations are expressed with a composable expression API (pl.col(), pl.lit(), .when()/.otherwise()) that keeps computations vectorized and parallel. For small-scale exploration you can use eager APIs (read_csv(), select()), while for scalable pipelines use lazy scanning and .collect() to materialize results.

When to use it

Your pandas workflows are too slow on medium-to-large datasets (1–100GB that fit in RAM).
Building ETL pipelines where query optimization and pushdown reduce I/O and compute.
Processing large CSV, Parquet, or Arrow/IPCs and needing fast parallel reads/writes.
You want a pandas-alternative with a concise expression API and strong performance.
You need columnar, memory-efficient IO (Parquet/Arrow) and zero-copy support.

Best practices

Prefer lazy APIs (scan_csv/scan_parquet) for large files and call .collect() only once after building the query.
Select required columns early and filter rows early to benefit from projection and predicate pushdown.
Use Parquet or Arrow for repeated reads and faster I/O; reserve CSV for ingestion or small files.
Avoid Python UDFs and map_rows when possible—use expression operators to keep operations parallelized.
Set dtypes on read to skip costly inference and reduce memory spikes.

Example use cases

ETL pipeline: scan multiple input files lazily, apply transformations with expressions, aggregate, and write Parquet.
Large CSV processing: scan_csv -> filter and select -> .collect() to materialize a reduced dataset.
Pandas migration: convert index/column logic to select/filter/with_columns using pl.col() and .alias().
Analytics: group_by().agg() with window functions for rolling metrics at high speed.
Joining multiple tables: perform inner/left/anti joins using expression-driven keys with efficient parallel execution.

FAQ

When should I use lazy vs eager mode?

Use eager for quick exploration and small files; use lazy (scan_*) for large datasets and pipelines so the optimizer can reduce work before execution.

How do I get best IO performance?

Prefer Parquet or IPC/Arrow for read/write, select columns early, and avoid Python UDFs to keep parallelism and zero-copy benefits.