home / skills / eyadsibai / ltk / polars

This skill speeds large CSV processing and ETL pipelines by leveraging Polars lazy evaluation, parallel execution, and expression-based queries.

npx playbooks add skill eyadsibai/ltk --skill polars

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.4 KB
---
name: polars
description: Use when "Polars", "fast dataframe", "lazy evaluation", "Arrow backend", or asking about "pandas alternative", "parallel dataframe", "large CSV processing", "ETL pipeline", "expression API"
version: 1.0.0
---

# Polars Fast DataFrame Library

Lightning-fast DataFrame library with lazy evaluation and parallel execution.

## When to Use

- Pandas is too slow for your dataset
- Working with 1-100GB datasets that fit in RAM
- Need lazy evaluation for query optimization
- Building ETL pipelines
- Want parallel execution without extra config

---

## Lazy vs Eager Evaluation

| Mode | Function | Executes | Use Case |
|------|----------|----------|----------|
| **Eager** | `read_csv()` | Immediately | Small data, exploration |
| **Lazy** | `scan_csv()` | On `.collect()` | Large data, pipelines |

**Key concept**: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).

---

## Core Operations

### Data Selection

| Operation | Purpose |
|-----------|---------|
| `select()` | Choose columns |
| `filter()` | Choose rows by condition |
| `with_columns()` | Add/modify columns |
| `drop()` | Remove columns |
| `head(n)` / `tail(n)` | First/last n rows |

### Aggregation

| Operation | Purpose |
|-----------|---------|
| `group_by().agg()` | Group and aggregate |
| `pivot()` | Reshape wide |
| `melt()` | Reshape long |
| `unique()` | Distinct values |

### Joins

| Join Type | Description |
|-----------|-------------|
| **inner** | Matching rows only |
| **left** | All left + matching right |
| **outer** | All rows from both |
| **cross** | Cartesian product |
| **semi** | Left rows with match |
| **anti** | Left rows without match |

---

## Expression API

**Key concept**: Polars uses expressions (`pl.col()`) instead of indexing. Expressions are lazily evaluated and optimized.

### Common Expressions

| Expression | Purpose |
|------------|---------|
| `pl.col("name")` | Reference column |
| `pl.lit(value)` | Literal value |
| `pl.all()` | All columns |
| `pl.exclude(...)` | All except |

### Expression Methods

| Category | Methods |
|----------|---------|
| **Aggregation** | `.sum()`, `.mean()`, `.min()`, `.max()`, `.count()` |
| **String** | `.str.contains()`, `.str.replace()`, `.str.to_lowercase()` |
| **DateTime** | `.dt.year()`, `.dt.month()`, `.dt.day()` |
| **Conditional** | `.when().then().otherwise()` |
| **Window** | `.over()`, `.rolling_mean()`, `.shift()` |

---

## Pandas Migration

| Pandas | Polars |
|--------|--------|
| `df['col']` | `df.select('col')` |
| `df[df['col'] > 5]` | `df.filter(pl.col('col') > 5)` |
| `df['new'] = df['col'] * 2` | `df.with_columns((pl.col('col') * 2).alias('new'))` |
| `df.groupby('col').mean()` | `df.group_by('col').agg(pl.all().mean())` |
| `df.apply(func)` | `df.map_rows(func)` (avoid if possible) |

**Key concept**: Polars prefers explicit operations over implicit indexing. Use `.alias()` to name computed columns.

---

## File I/O

| Format | Read | Write | Notes |
|--------|------|-------|-------|
| **CSV** | `read_csv()` / `scan_csv()` | `write_csv()` | Human readable |
| **Parquet** | `read_parquet()` / `scan_parquet()` | `write_parquet()` | Fast, compressed |
| **JSON** | `read_json()` / `scan_ndjson()` | `write_json()` | Newline-delimited |
| **IPC/Arrow** | `read_ipc()` / `scan_ipc()` | `write_ipc()` | Zero-copy |

**Key concept**: Use Parquet for performance. Use `scan_*` for large files to enable lazy optimization.

---

## Performance Tips

| Tip | Why |
|-----|-----|
| Use lazy mode | Query optimization |
| Use Parquet | Column-oriented, compressed |
| Select columns early | Projection pushdown |
| Filter early | Predicate pushdown |
| Avoid Python UDFs | Breaks parallelism |
| Use expressions | Vectorized operations |
| Set dtypes on read | Avoid inference overhead |

---

## vs Alternatives

| Tool | Best For | Limitations |
|------|----------|-------------|
| **Polars** | 1-100GB, speed critical | Must fit in RAM |
| **Pandas** | Small data, ecosystem | Slow, memory hungry |
| **Dask** | Larger than RAM | More complex API |
| **Spark** | Cluster computing | Infrastructure overhead |
| **DuckDB** | SQL interface | Different API style |

## Resources

- Docs: <https://pola.rs/>
- User Guide: <https://docs.pola.rs/user-guide/>
- Cookbook: <https://docs.pola.rs/user-guide/misc/cookbook/>

Overview

This skill provides guidance for using Polars, a lightning-fast DataFrame library with lazy evaluation and parallel execution. It focuses on when to pick Polars over other tools, how its lazy expression API works, and practical tips for high-performance ETL and data processing. Use it to migrate pandas workflows or to process large CSV/Parquet datasets efficiently.

How this skill works

Polars builds query plans in lazy mode (scan_*) that the optimizer rewrites before execution, applying predicate and projection pushdown to minimize work. Its core operations are expressed with a composable expression API (pl.col(), pl.lit(), .when()/.otherwise()) that keeps computations vectorized and parallel. For small-scale exploration you can use eager APIs (read_csv(), select()), while for scalable pipelines use lazy scanning and .collect() to materialize results.

When to use it

  • Your pandas workflows are too slow on medium-to-large datasets (1–100GB that fit in RAM).
  • Building ETL pipelines where query optimization and pushdown reduce I/O and compute.
  • Processing large CSV, Parquet, or Arrow/IPCs and needing fast parallel reads/writes.
  • You want a pandas-alternative with a concise expression API and strong performance.
  • You need columnar, memory-efficient IO (Parquet/Arrow) and zero-copy support.

Best practices

  • Prefer lazy APIs (scan_csv/scan_parquet) for large files and call .collect() only once after building the query.
  • Select required columns early and filter rows early to benefit from projection and predicate pushdown.
  • Use Parquet or Arrow for repeated reads and faster I/O; reserve CSV for ingestion or small files.
  • Avoid Python UDFs and map_rows when possible—use expression operators to keep operations parallelized.
  • Set dtypes on read to skip costly inference and reduce memory spikes.

Example use cases

  • ETL pipeline: scan multiple input files lazily, apply transformations with expressions, aggregate, and write Parquet.
  • Large CSV processing: scan_csv -> filter and select -> .collect() to materialize a reduced dataset.
  • Pandas migration: convert index/column logic to select/filter/with_columns using pl.col() and .alias().
  • Analytics: group_by().agg() with window functions for rolling metrics at high speed.
  • Joining multiple tables: perform inner/left/anti joins using expression-driven keys with efficient parallel execution.

FAQ

When should I use lazy vs eager mode?

Use eager for quick exploration and small files; use lazy (scan_*) for large datasets and pipelines so the optimizer can reduce work before execution.

How do I get best IO performance?

Prefer Parquet or IPC/Arrow for read/write, select columns early, and avoid Python UDFs to keep parallelism and zero-copy benefits.