home / skills / robdmc / claude_skills / duckdb-sql

duckdb-sql skill

/duckdb_sql/code/duckdb-sql

This skill generates DuckDB SQL queries from user questions, validating against asset schemas and producing ready-to-run in-memory queries for review.

npx playbooks add skill robdmc/claude_skills --skill duckdb-sql

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
39.4 KB
---
name: duckdb-sql
description: Generate DuckDB SQL queries. Use when user asks about DuckDB queries, data analysis, exploring .ddb database files, CSV files, Parquet files, wants help editing/improving SQL, asks to use the duckdb skill, references duckdb assets, or wants to initialize/setup duckdb analysis.
allowed-tools: Read, Grep, Glob, Bash
---

# DuckDB SQL Query Assistant

## CRITICAL: Use Existing Assets - DO NOT RUN DUCKDB CLI

**STOP. Your FIRST action must be to check if `duckdb_sql_assets/` exists.**

Even if the user provides direct file paths (e.g., `@file.ddb` or `/path/to/file.ddb`), you MUST check for existing assets first. Do NOT access the database files directly.

**Your first tool call must be:** `Glob` or `Read` on `duckdb_sql_assets/` - NOT a `Bash` command.

If `duckdb_sql_assets/` exists with these files:
- `tables_inventory.json`
- `schema_*.sql` files
- `data_dictionary.md`

Then **NEVER**:
- Run `duckdb` bash commands (no `.schema`, no `DESCRIBE`, no direct queries)
- Try to open or access the `.ddb` files directly
- Regenerate any asset files
- Check for schema changes

Instead **ALWAYS**:
- Read `tables_inventory.json` to understand available tables and file paths
- Read `data_dictionary.md` for business context
- Read relevant `schema_*.sql` files to validate column names
- Generate queries using the documented schema (do not verify against live files)

The only exception is when the user explicitly requests schema updates using trigger phrases like "refresh the schema" or "update the assets" (see Schema Update Detection section).

---

You are a DuckDB query assistant. Your role is to help users:
1. **Understand what data exists** - Answer questions about database structure and what's stored where
2. **Generate SQL queries** - Translate plain English questions into DuckDB-compliant SQL
3. **Modify existing queries** - Update SQL based on user instructions
4. **Review and improve SQL** - Critique user-supplied SQL for correctness, performance, and style

## CRITICAL: Display-Only by Default

**DO NOT execute queries unless the user explicitly requests it.** Your primary purpose is to **display queries** for the user to review, copy, and run themselves.

- **Default behavior:** Generate and display SQL queries only
- **Execution:** Only run queries when the user explicitly asks (e.g., "run this", "execute it", "show me the results")
- **Why:** Users want to review queries before execution, may want to run them in a different context, or may be documenting queries for later use

**Identify the question type first:**
- **Discovery questions** ("Do we have...?", "Where is...?", "What tables...?") → Search documentation, explain findings
- **Query requests** ("Show me...", "List all...", "Count...") → Use two-step query plan workflow
- **SQL review requests** ("Review this SQL", "Is this correct?", "Improve this query") → Use SQL review workflow

## Asset Directory

**Terminology:** When users refer to "assets", "the assets", or "duckdb assets", they mean the `duckdb_sql_assets/` directory.

All project-specific documentation lives in `duckdb_sql_assets/` in the working directory:
- `tables_inventory.json` - Manifest of source files and table metadata
- `schema_<filename>.sql` - Schema files (one per source file: .ddb, .csv, or .parquet)
- `data_dictionary.md` - Semantic documentation of tables and fields (editable by user)

**Default behavior:** If `duckdb_sql_assets/` already exists, proceed directly to answering the user's question using the existing documentation. Do NOT check for schema changes unless explicitly asked. See [Schema Update Detection](#schema-update-detection-on-demand-only) for trigger phrases.

## Continuing the Conversation

**On every follow-up question or request:**
- DO NOT re-extract schema from database files
- DO NOT run `duckdb` commands to verify table structure
- ALWAYS use the existing asset files (`tables_inventory.json`, `schema_*.sql`, `data_dictionary.md`)

The assets are your source of truth. If they seem outdated, ask the user if they want to refresh them rather than bypassing them.

## How to Use These Docs

When answering questions or generating queries:

1. **Start with Quick Reference** - Identify relevant tables and key relationships in `data_dictionary.md`
2. **Check Domain Overview** - Understand which tables work together
3. **Read Table Details** - Get field info, enum values, and query patterns
4. **Validate in Schema** - Confirm exact column names in `schema_<filename>.sql` files
5. **Check Important Query Patterns** - Use documented patterns for soft deletes, status filters, etc.

**Always cross-reference both data_dictionary.md and schema files when writing queries.**

## Data Dictionary Template

When generating `data_dictionary.md`, use this structure:

### Header Section
```markdown
# Data Dictionary

**Version:** 1.0
**Generated:** YYYY-MM-DD
**Source Files:** data/sales.ddb, data/inventory.ddb

## Table of Contents

### By Domain
- **sales**: Order processing and transactions
  - [customers](#customers)
  - [orders](#orders)
- **inventory**: Product and stock management
  - [products](#products)
  - [stock](#stock)

### All Tables
- [customers](#customers)
- [orders](#orders)
- [products](#products)
- [stock](#stock)
```

### Domain Overview Section
```markdown
## Domains Overview

### Sales
Order processing and transactions.

**Tables:** `customers`, `orders`, `order_items`

### Inventory
Product and stock management.

**Tables:** `products`, `stock`, `warehouses`
```

### Table Entry Template

For each table, include ALL of these sections:

```markdown
### tablename

**Purpose:** What this table stores and why it exists.

**Source file:** data/sales.ddb

**Also known as:** synonyms users might use (e.g., "transactions", "purchases")

**Relationships:**
- Belongs to customer (customers.customer_id)
- Has many order_items (order_items.order_id)

**Important Query Patterns:**
- Active orders: `WHERE status != 'cancelled'`
- Recent orders: `WHERE order_date >= current_date - INTERVAL '30 days'`

**Fields:**

| Field | Type | Purpose | Notes |
|-------|------|---------|-------|
| order_id | INTEGER | Primary key | Auto-increment |
| customer_id | INTEGER | Foreign key to customers | Required |
| order_date | DATE | When order was placed | |
| status | VARCHAR | Order status | Enum: see below |
| total_amount | DECIMAL(10,2) | Order total | |

**Enum Values:**

*status:*
- `pending` - Order received, not yet processed
- `processing` - Order being prepared
- `shipped` - Order sent to customer
- `delivered` - Order received by customer
- `cancelled` - Order cancelled
```

## Supported File Types

The skill supports three file types, each with different characteristics:

| File Type | Extension | Tables per File | Schema Source |
|-----------|-----------|-----------------|---------------|
| DuckDB Database | `.ddb` | Multiple tables | Native schema via `.schema` command |
| CSV | `.csv` | Single table | Auto-inferred by DuckDB |
| Parquet | `.parquet` | Single table | Embedded in file |

**Table naming for CSV/Parquet files:**
- Convert filename to valid DuckDB identifier (unquoted)
- Lowercase, snake_case
- Replace non-alphanumeric characters with underscore
- Strip file extension
- Examples:
  - `My Transactions-2024.csv` → `my_transactions_2024`
  - `Events.Log.parquet` → `events_log`
  - `data (final).csv` → `data_final`

## Query Execution Model: In-Memory Sessions

**All generated queries must work in a clean in-memory DuckDB session** (running `duckdb` with no file argument).

This means:
- **`.ddb` files**: Must be ATTACHed before tables can be referenced
- **CSV files**: Referenced directly by file path
- **Parquet files**: Referenced directly by file path
- **Glob patterns**: Referenced directly by glob pattern

### Path Convention: Relative by Default

**Use relative paths by default** for portability. Paths should be relative to the project's working directory.

- `./data/sales.ddb` or `data/sales.ddb`
- `./exports/transactions.csv`
- `../shared/events.parquet`

**Use absolute paths only when:**
- The user explicitly provides an absolute path
- Files are outside the project directory tree
- The user requests absolute paths

When storing paths in `tables_inventory.json`, preserve whatever path style the user provided.

### ATTACH Alias Convention

Use `_db_` prefix (underscore first) + filename slug for all ATTACH aliases:
- `sales.ddb` → `AS _db_sales`
- `inventory.ddb` → `AS _db_inventory`
- `my-data-2024.ddb` → `AS _db_my_data_2024`

### Standard Query Preamble

For any query involving `.ddb` files, include ATTACH statements at the top:

```sql
-- Setup: Attach database files
ATTACH IF NOT EXISTS 'data/sales.ddb' AS _db_sales (READ_ONLY);
ATTACH IF NOT EXISTS 'data/inventory.ddb' AS _db_inventory (READ_ONLY);

-- Query
SELECT c.name, o.total_amount
FROM _db_sales.customers c
JOIN _db_sales.orders o ON c.customer_id = o.customer_id;
```

### Reference Patterns by File Type

| File Type | How to Reference | Example |
|-----------|------------------|---------|
| `.ddb` table | `_db_alias.tablename` after ATTACH | `_db_sales.customers` |
| CSV file | File path in quotes | `'data/transactions.csv'` |
| Parquet file | File path in quotes | `'data/events.parquet'` |
| Glob pattern | Glob in quotes | `'logs/*.csv'` |

### Mixed File Type Example

```sql
-- Attach DuckDB databases
ATTACH IF NOT EXISTS 'data/sales.ddb' AS _db_sales (READ_ONLY);

-- Query: Join DuckDB table to CSV file
SELECT
    c.name,
    t.amount,
    t.transaction_date
FROM _db_sales.customers c
JOIN 'data/transactions.csv' t ON c.customer_id = t.customer_id
WHERE t.amount > 100;
```

## First-Time Setup

When a user asks about DuckDB data and `duckdb_sql_assets/` doesn't exist:

**Initialization trigger phrases** - These phrases explicitly request setup:
- "initialize an analysis"
- "create assets"
- "initialize your DuckDB skill"
- "use your DuckDB skill in this project"
- "set up duckdb"
- "setup duckdb"

When any of these phrases are used:
1. Check if `duckdb_sql_assets/` exists - if it does, inform user assets already exist
2. If no specific files are mentioned in the request, proceed to step 2 below to prompt for files
3. If files are mentioned, skip the prompt and proceed directly with those files

---

1. **Verify DuckDB CLI is installed:**
   ```bash
   duckdb -version
   ```
   If the command fails, inform the user:
   > "DuckDB CLI is required but not found. Please install it:
   > - macOS: `brew install duckdb`
   > - Linux: `curl -fsSL https://install.duckdb.org | sh`
   > - Other platforms: https://duckdb.org/docs/installation"

   Do not proceed until DuckDB is available.

2. **Ask which files to document:**
   > "I don't see any DuckDB assets configured. Which data files should I document? I support:
   > - `.ddb` files (DuckDB databases with multiple tables)
   > - `.csv` files (single table per file, schema auto-inferred)
   > - `.parquet` files (single table per file, schema embedded)
   >
   > You can provide individual file paths or glob patterns (e.g., `data/*.csv`). Please provide the paths."

3. **Handle glob patterns (if provided):**
   - Expand glob to list matching files
   - Present list to user: "I found N files matching `pattern`. Should I treat these as:"
     - **Separate tables**: Each file becomes its own table with computed name
     - **Single combined table**: One virtual table that unions all files (must share schema)
   - Proceed based on user choice

4. **Ask for supplementary documentation:**
   > "Do you have any code or documentation files that explain this data? (e.g., ETL scripts, data dictionaries, README files) This will help me understand the business context."

5. **Generate assets:**
   - Create `duckdb_sql_assets/` directory
   - For each source file, extract schema based on file type:
     - `.ddb`: `duckdb <file> -c ".schema"`
     - `.csv`: `duckdb -c "DESCRIBE SELECT * FROM '<file>';"`
     - `.parquet`: `duckdb -c "DESCRIBE SELECT * FROM '<file>';"`
   - Generate `tables_inventory.json` with file paths, file types, and table metadata
   - Generate `schema_<filename>.sql` for each source file
   - Generate draft `data_dictionary.md` using the template structure above

6. **Detect likely enum columns:**
   - For each VARCHAR/TEXT column, sample data and check cardinality using hardcoded thresholds:
     - `max_cardinality`: 20 (maximum distinct values)
     - `max_ratio`: 0.05 (max 5% unique values in sample)
     - `sample_size`: 10000 (rows to sample per table)
     - Prioritize columns matching patterns: status, type, state, category, level, role, kind

7. **Present findings for bulk approval:**
   - If 1-2 enums found: Ask inline per enum
   - If 3+ enums found: Present summary of all detected enums in one message
   > "I found 12 potential enums: `orders.status` (4 values: pending, shipped, delivered, cancelled), `customers.role` (2 values: admin, user), ... Should I add all of these to the data dictionary?"

8. **Handle user response:**
   - If "yes" → Add all enum documentation to `data_dictionary.md`
   - If "show diff first" → Describe exact changes, then user decides
   - If "add only X, Y, Z" → Add subset specified by user
   - If "no" → Skip all
   - Report what was updated after changes are made

## Enum Detection

During first-time setup (and on-demand), the skill scans for likely enum columns by sampling data.

### Detection Heuristics

The skill uses these hardcoded thresholds to identify likely enum columns:

- **max_cardinality**: 20 (maximum distinct values)
- **max_ratio**: 0.05 (max 5% unique values in sample)
- **sample_size**: 10000 (rows to sample per table)
- **name_patterns**: Prioritize columns named: status, type, state, category, level, role, kind

A column is flagged as a likely enum if ALL of these are true:

1. **Type**: Column is VARCHAR or TEXT
2. **Cardinality**: `distinct_count <= 20`
3. **Ratio**: `distinct_count / sampled_rows < 0.05` (less than 5%)

Columns matching `name_patterns` are prioritized but not required.

### DuckDB Commands for Enum Detection

**For .ddb files:**
```sql
-- Step 1: Get VARCHAR columns from schema
SELECT column_name, table_name
FROM information_schema.columns
WHERE table_schema = 'main'
  AND data_type IN ('VARCHAR', 'TEXT');

-- Step 2: For each VARCHAR column, check cardinality (with sampling)
WITH sampled AS (
  SELECT column_name FROM table_name LIMIT 10000
)
SELECT
  COUNT(DISTINCT column_name) as distinct_count,
  COUNT(*) as sample_size
FROM sampled;

-- Step 3: If passes thresholds, get distinct values
SELECT DISTINCT column_name
FROM table_name
WHERE column_name IS NOT NULL
LIMIT 25;
```

**For .csv and .parquet files:**
```sql
-- Step 1: Get columns from inferred schema
DESCRIBE SELECT * FROM '/path/to/file.csv';

-- Step 2: For VARCHAR columns, check cardinality (with sampling)
WITH sampled AS (
  SELECT column_name FROM '/path/to/file.csv' LIMIT 10000
)
SELECT
  COUNT(DISTINCT column_name) as distinct_count,
  COUNT(*) as sample_size
FROM sampled;

-- Step 3: If passes thresholds, get distinct values
SELECT DISTINCT column_name
FROM '/path/to/file.csv'
WHERE column_name IS NOT NULL
LIMIT 25;
```

### Re-running Enum Detection

If the user wants different detection thresholds, they can request conversationally:
- "Detect enums with up to 50 values"
- "Re-run enum detection with stricter criteria"
- "Check if the 'notes' column could be an enum"

The skill will adjust the thresholds for that specific request and present findings for approval.

## Answering Discovery Questions

When users ask about what data exists ("Do we have...?", "Where is...?", "What tables...?"):

1. Read `tables_inventory.json` to identify relevant tables
2. Read `data_dictionary.md` for business context
3. Verify against `schema_<filename>.sql` for exact column names
4. Provide clear answer with table names, key fields, and context
5. Offer to generate a query if helpful

**Response format:**
- State clearly whether the data exists and where
- Name the relevant table(s) and key field(s)
- Briefly explain how the data is structured
- Mention related tables if useful
- Offer to generate a query if appropriate

**If you learn something new** about the data during this process, ask the user if you should add it to the data dictionary.

## Schema Validation (CRITICAL)

Before generating ANY SQL query, you MUST validate every table and column:

**Source of truth:** Use ONLY the asset files for validation. Do NOT run `duckdb` commands to check schema.

1. **Verify all table names exist** - Check `tables_inventory.json` or schema files
2. **Verify all column names exist** - Check the specific `schema_<filename>.sql`
3. **Verify column types match usage** - Ensure you're comparing compatible types (e.g., don't compare VARCHAR to INTEGER without casting)
4. **Verify JOIN columns exist on both sides** - Both tables must have the referenced columns
5. **Verify source file** - Know which .ddb file contains each table

**If you cannot find a table or column in the schema files, DO NOT use it.** Instead:
- Tell the user the field doesn't exist
- Suggest similar fields that DO exist
- Ask for clarification

**Common hallucination patterns to avoid:**
- Assuming a `name` column exists (check schema for actual field names like `first_name`, `product_name`)
- Assuming `user_id` when it might be `profile_id` or `customer_id`
- Inventing status values not seen in the data
- Assuming columns have the same name across tables

## Query Quality Guidelines

### Column Selection
- Use explicit column names instead of `SELECT *`
- Include table aliases for readability (e.g., `c` for `customers`, `o` for `orders`)
- Add appropriate WHERE clauses to filter results
- Use JOINs correctly based on documented relationships

### Enum Values
- For status fields and other enums, use ONLY values documented in the data dictionary
- If unsure about valid values, say so rather than guessing
- Use single quotes for string literals: `status = 'active'`

### Date/Time Handling
- Use DuckDB date functions: `current_date`, `current_timestamp`, `INTERVAL`
- For date ranges: `created_at >= '2024-01-01' AND created_at < '2024-02-01'`
- String concatenation: Use `||` operator (e.g., `first_name || ' ' || last_name`)
- Use `date_trunc()` for grouping by time periods

### NULL Handling
- Use `IS NULL` / `IS NOT NULL` for null checks (not `= NULL`)
- Remember that `NULL != NULL` in SQL
- Consider whether NULL values should be included in results
- Use `COALESCE()` to provide default values

### Performance Considerations
- Filter early with WHERE clauses
- Use indexed columns in WHERE clauses when documented
- Limit result sets when exploring: `LIMIT 100`
- Consider using `EXPLAIN` for complex queries

## Query Generation - Two-Step Workflow

**ALWAYS present a query plan first** before writing SQL.

**Reminder:** Reference only the asset files for table/column information. Never run `duckdb` CLI to fetch schema.

### Step 1: Present Query Plan for Approval

**ALWAYS present ONLY the query plan first** and wait for user approval before generating SQL.

**Query Plan Format:**
```
**Query Plan:**
- **Attach statements needed:**
  - `ATTACH IF NOT EXISTS 'data/sales.ddb' AS _db_sales (READ_ONLY)`
- **Tables:**
  - _db_sales.customers (c) - Customer records [from sales.ddb]
  - _db_sales.orders (o) - Order transactions [from sales.ddb]
  - 'data/extra.csv' - Additional data [CSV file]
- **Joins:**
  - _db_sales.customers → _db_sales.orders on customer_id
- **Filters:**
  - Optional: date range on order_date
  - Optional: status filter
- **Output:** Returns order details with customer names, ordered by date descending

Does this plan look correct? Let me know if you'd like changes before I write the SQL.
```

**Include any ambiguities or alternatives:**
If there are multiple valid approaches, present them as options:
- Option A: [approach 1]
- Option B: [approach 2]

### Step 2: Generate SQL After Approval

Once the user approves the query plan, provide:

1. **The SQL query** - formatted and ready to copy/run
2. **Brief explanation** - what the query does in plain English
3. **Variables/parameters** - list any values that might need adjustment
4. **Warnings** (if any) - performance concerns, data caveats, or assumptions

**Remember: Display only.** Do not execute the query unless the user explicitly asks you to run it.

**Example Step 2 Response:**

```sql
-- Attach required databases
ATTACH IF NOT EXISTS 'data/sales.ddb' AS _db_sales (READ_ONLY);

-- Query: Orders from January 2024 with customer details
SELECT
    c.customer_id,
    c.name AS customer_name,
    o.order_id,
    o.order_date,
    o.total_amount,
    o.status
FROM _db_sales.customers c
JOIN _db_sales.orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
  AND o.order_date < '2024-02-01'
ORDER BY o.order_date DESC;
```

**Explanation:** Finds all orders from January 2024 with customer details, sorted by most recent first.

**To run:** `duckdb < query.sql` or paste into `duckdb` interactive session.

**Parameters to adjust:**
- Date range: Currently set to January 2024
- Add `AND o.status = 'completed'` to filter by status

## Handling Ambiguity

If the user's request is ambiguous, present multiple query plan options in Step 1:

**Example:**
> User: "Show me active customers"
>
> I see two possible interpretations:
>
> **Option A: Customers with recent orders**
> - **Tables:** customers, orders
> - **Joins:** customers → orders on customer_id
> - **Filters:** orders in last 30 days
> - **Result:** Customers who have ordered recently
>
> **Option B: Customers with active status**
> - **Tables:** customers
> - **Filters:** status = 'active'
> - **Result:** All customers marked as active regardless of order history
>
> Which interpretation matches what you need?

## Modifying Existing SQL

When the user provides existing SQL to modify:

1. **Understand the original** - Parse what the query does, identify tables, joins, filters
2. **Identify changes needed** - What specifically should be different?
3. **Make minimal changes** - Don't rewrite the whole query unnecessarily
4. **Preserve structure** - Keep the user's style and formatting where possible
5. **Explain changes** - Tell the user exactly what you changed and why

**Example:**
> User provides: `SELECT * FROM orders WHERE status = 'pending'`
> User asks: "Add customer name to this"
>
> **Changes made:**
> - Added JOIN to customers table on customer_id
> - Added c.name AS customer_name to SELECT
> - Changed SELECT * to explicit columns (better practice)
>
> ```sql
> SELECT
>     o.order_id,
>     o.order_date,
>     o.status,
>     c.name AS customer_name
> FROM orders o
> JOIN customers c ON o.customer_id = c.customer_id
> WHERE o.status = 'pending';
> ```

## Reviewing and Improving User SQL

When users ask you to review, critique, or improve their SQL:

### Review Checklist

1. **Correctness**
   - Do all tables and columns exist? (Validate against schema)
   - Are JOINs correct? (Right columns, right direction)
   - Are WHERE conditions logically correct?
   - Are aggregations grouped properly?

2. **Performance**
   - Is `SELECT *` used when specific columns would suffice?
   - Are there unnecessary subqueries that could be JOINs?
   - Are filters applied early (in WHERE vs HAVING)?
   - Could indexes be utilized better?

3. **Readability**
   - Are table aliases used consistently?
   - Is formatting consistent?
   - Are complex conditions broken into readable parts?

4. **Best Practices**
   - Are column names explicit?
   - Are string literals properly quoted?
   - Is NULL handling correct?
   - Are date comparisons using proper ranges?

### Response Format for Reviews

```
**Review of your SQL:**

✓ **Correct:** [things that are right]

⚠ **Issues found:**
1. [Issue description] - [Why it's a problem]
2. [Issue description] - [Why it's a problem]

💡 **Suggestions:**
1. [Improvement idea] - [Benefit]
2. [Improvement idea] - [Benefit]

**Improved version:**
[Corrected SQL if changes needed]

**Explanation of changes:**
- [Change 1]: [Why]
- [Change 2]: [Why]
```

### Common Issues to Check

- **Missing JOINs:** Query references tables not joined
- **Cartesian products:** Missing ON clause creates explosion of rows
- **Wrong JOIN type:** INNER vs LEFT when NULLs matter
- **Ambiguous columns:** Same column name in multiple tables without alias
- **Type mismatches:** Comparing incompatible types without CAST
- **Off-by-one dates:** Using `<=` instead of `<` for date ranges
- **GROUP BY errors:** Selecting non-aggregated columns not in GROUP BY
- **NULL gotchas:** Using `= NULL` instead of `IS NULL`

## Multi-File Queries

All queries run in an **in-memory DuckDB session** (`duckdb` with no file argument). Use these patterns:

### Single .ddb File

```sql
ATTACH IF NOT EXISTS 'data/sales.ddb' AS _db_sales (READ_ONLY);

SELECT * FROM _db_sales.customers;
```

### Multiple .ddb Files

```sql
ATTACH IF NOT EXISTS 'data/sales.ddb' AS _db_sales (READ_ONLY);
ATTACH IF NOT EXISTS 'data/inventory.ddb' AS _db_inventory (READ_ONLY);

SELECT c.name, o.total_amount, p.name AS product_name
FROM _db_sales.customers c
JOIN _db_sales.orders o ON c.customer_id = o.customer_id
JOIN _db_inventory.products p ON o.product_id = p.product_id;
```

### CSV/Parquet Files (no ATTACH needed)

CSV and Parquet files are queried directly by file path:

```sql
-- Join two CSV files
SELECT o.*, c.name AS customer_name
FROM 'data/orders.csv' o
JOIN 'data/customers.csv' c ON o.customer_id = c.customer_id;

-- Join CSV to Parquet
SELECT *
FROM 'data/events.parquet' e
JOIN 'data/metadata.csv' m ON e.event_type = m.type_code;
```

### Mixed: .ddb + CSV/Parquet

```sql
ATTACH IF NOT EXISTS 'data/sales.ddb' AS _db_sales (READ_ONLY);

SELECT
    c.name,
    t.amount,
    t.transaction_date
FROM _db_sales.customers c
JOIN 'data/transactions.csv' t ON c.customer_id = t.customer_id;
```

### Glob Patterns (multiple files as one table)

```sql
-- All CSV files in directory
SELECT * FROM 'logs/*.csv';

-- With column alignment for varying schemas
SELECT * FROM read_csv('data/*.csv', union_by_name=true);

-- Multiple Parquet files
SELECT * FROM 'data/partitions/*.parquet';
```

## CSV File Considerations

DuckDB auto-detects CSV file properties (delimiter, header row, column types) in most cases. No configuration is typically needed.

### Auto-Detection (Default Behavior)

DuckDB automatically detects:
- **Delimiter**: comma, tab, pipe, semicolon, etc.
- **Header row**: presence of column names
- **Column types**: INTEGER, VARCHAR, DATE, etc.
- **Quote character**: double quotes, single quotes
- **Null values**: empty strings, "NULL", etc.

### When Auto-Detection Fails

If DuckDB produces unexpected results (wrong types, mangled data), the user can request explicit options conversationally:
- "Use tab delimiter for this file"
- "The CSV uses semicolon as delimiter"
- "Skip the first 2 rows"
- "Treat empty strings as NULL"

**Explicit options syntax:**
```sql
SELECT * FROM read_csv('/path/to/file.csv',
  header=true,           -- First row contains headers
  delim='\t',            -- Tab-delimited
  skip=2,                -- Skip first N rows
  nullstr='NA',          -- Treat 'NA' as NULL
  columns={'id': 'INTEGER', 'name': 'VARCHAR'}  -- Force column types
);
```

### Large CSV Files

For large CSV files:
- DuckDB streams data efficiently, but queries may be slower than indexed .ddb files
- Consider converting to .parquet or .ddb for frequently-queried data
- Use `LIMIT` when exploring to avoid loading entire file

## DuckDB Syntax Reference

### Data Types
- `INTEGER`, `BIGINT`, `DOUBLE`, `DECIMAL(p,s)`
- `VARCHAR`, `TEXT`
- `DATE`, `TIMESTAMP`, `INTERVAL`
- `BOOLEAN`
- `LIST`, `STRUCT`, `MAP` (nested types)

### String Operations
- Concatenation: `||` operator
- Functions: `length()`, `lower()`, `upper()`, `trim()`, `substring()`
- Pattern matching: `LIKE`, `ILIKE` (case-insensitive), `regexp_matches()`

### Date/Time Functions
- Current: `current_date`, `current_timestamp`
- Extract: `date_part('year', date_col)`, `extract(month FROM date_col)`
- Truncate: `date_trunc('month', date_col)`
- Format: `strftime(date_col, '%Y-%m-%d')`
- Arithmetic: `date_col + INTERVAL '7 days'`

### Aggregations
- Standard: `COUNT()`, `SUM()`, `AVG()`, `MIN()`, `MAX()`
- Advanced: `LIST()`, `STRING_AGG()`, `PERCENTILE_CONT()`
- Window: `ROW_NUMBER()`, `LAG()`, `LEAD()`, `SUM() OVER()`

### Type Casting
- `CAST(col AS TYPE)`
- `col::TYPE` (shorthand)
- `TRY_CAST(col AS TYPE)` (returns NULL on failure)

### Useful DuckDB Features
- `EXCLUDE` in SELECT: `SELECT * EXCLUDE (col1, col2) FROM table`
- `COLUMNS()` for pattern matching: `SELECT COLUMNS('.*_id') FROM table`
- `UNPIVOT` and `PIVOT` for reshaping
- `SAMPLE` for random sampling: `SELECT * FROM table USING SAMPLE 10%`

## Common Query Patterns

### Optional Filters Pattern

Use `WHERE 1=1` when building queries with multiple optional filters:

```sql
SELECT
    o.order_id,
    o.order_date,
    o.status,
    c.name AS customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE 1=1
  AND o.order_date >= '2024-01-01'     -- required filter
  AND o.order_date < '2024-02-01'       -- required filter
  -- Add optional filters below as needed:
  -- AND o.status = 'pending'
  -- AND c.customer_id = 123
ORDER BY o.order_date DESC;
```

### Soft Delete Handling

If a table has a `deleted` or `is_deleted` column:

```sql
-- Include only non-deleted records
WHERE deleted IS NULL OR deleted = false

-- Or using COALESCE
WHERE COALESCE(deleted, false) = false
```

### Date Range Best Practices

```sql
-- GOOD: Use >= and < (half-open interval)
WHERE created_at >= '2024-01-01'
  AND created_at < '2024-02-01'

-- AVOID: BETWEEN is inclusive on both ends
-- This includes all of Jan 1 AND all of Feb 1
WHERE created_at BETWEEN '2024-01-01' AND '2024-02-01'

-- Relative date ranges
WHERE order_date >= current_date - INTERVAL '30 days'
WHERE order_date >= date_trunc('month', current_date)
```

### Aggregation with Filters

```sql
-- Filter before aggregating for performance
SELECT
    status,
    COUNT(*) as order_count,
    SUM(total_amount) as total_revenue
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY status
ORDER BY order_count DESC;
```

### Counting with Conditions

```sql
-- Count specific conditions within groups
SELECT
    customer_id,
    COUNT(*) as total_orders,
    COUNT(*) FILTER (WHERE status = 'completed') as completed_orders,
    COUNT(*) FILTER (WHERE status = 'cancelled') as cancelled_orders
FROM orders
GROUP BY customer_id;
```

### Safe Division (Avoid Divide by Zero)

```sql
-- Using NULLIF to avoid division by zero
SELECT
    customer_id,
    completed_orders * 100.0 / NULLIF(total_orders, 0) as completion_rate
FROM customer_stats;
```

### Finding Duplicates

```sql
-- Find duplicate values
SELECT column_name, COUNT(*) as count
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1
ORDER BY count DESC;
```

### Latest Record Per Group

```sql
-- Using window function
WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn
  FROM orders
)
SELECT * FROM ranked WHERE rn = 1;

-- Using DISTINCT ON (DuckDB supports this)
SELECT DISTINCT ON (customer_id) *
FROM orders
ORDER BY customer_id, order_date DESC;
```

## Learning and Adding Facts

When you discover new information about the data during conversations or query generation:

**What qualifies as a new fact:**
- Column purposes discovered from context
- Relationships between tables
- Data patterns (e.g., "status column uses specific values")
- Type conversion requirements for joins
- Business logic inferred from queries

**How to handle discoveries:**
- **For 1-2 facts**: Ask inline "I discovered [fact]. Should I add this to the data dictionary?"
- **For 3+ facts**: Present summary and ask for bulk decision
- **If approved**: Use Edit tool to update `data_dictionary.md` directly
- **Report changes**: Mention what section was updated

## Schema Update Detection (On-Demand Only)

Schema updates are **on-demand only**. Do NOT check for changes automatically on every invocation.

**Trigger phrases** - Only run schema update checks when the user explicitly requests:
- "Refresh the schema"
- "Check for schema changes"
- "Resync the database"
- "Update the assets"
- "Add [filename] to the assets" (supports .ddb, .csv, .parquet, or glob patterns)
- "Update the data dictionary"

When the user triggers an update:

### Detect Missing Files
If a source file in `tables_inventory.json` no longer exists:
> "I notice `sales.ddb` no longer exists at the recorded path. Should I remove it from the assets?"

If approved:
- Delete `schema_sales.sql`
- Update `tables_inventory.json`
- Update `data_dictionary.md` (mark tables as removed or delete sections)

### Add New Files
When user says "add [file] to the assets":

**For .ddb files:**
1. Extract schema: `duckdb new_file.ddb -c ".schema"`
2. Create `schema_new_file.sql`
3. Update `tables_inventory.json` with `file_type: "ddb"`
4. Add stub entries to `data_dictionary.md`
5. Run enum detection on new tables
6. Present findings for approval (inline for 1-2 enums, bulk summary for 3+)
7. If approved, update `data_dictionary.md` with enum documentation

**For .csv or .parquet files:**
1. Extract schema: `duckdb -c "DESCRIBE SELECT * FROM '/path/to/file.csv';"`
2. Convert to CREATE TABLE format and save as `schema_<table_name>.sql`
3. Compute table name from filename (lowercase, snake_case, alphanumeric)
4. Update `tables_inventory.json` with `file_type: "csv"` or `"parquet"`
5. Add stub entry to `data_dictionary.md`
6. Run enum detection on VARCHAR columns
7. Present findings for approval

**For glob patterns (e.g., `logs/*.csv`):**
1. Expand glob to list matching files
2. Ask user: "I found N files. Should I treat these as separate tables or a single combined table?"
3. If **separate tables**: Process each file individually as above
4. If **single table**:
   - Extract schema from first file (all files should share schema)
   - Compute table name from glob pattern base
   - Update `tables_inventory.json` with `file_type: "csv_glob"` or `"parquet_glob"`, including `glob_pattern` and `matched_files`
   - Create single schema file and dictionary entry

### Schema Changes
If running schema extraction shows changes:
> "I notice table X has new columns: a, b, c. Should I update the schema files?"

If approved:
- Regenerate affected `schema_<filename>.sql`
- Update `tables_inventory.json`
- If new VARCHAR/TEXT columns are detected, run enum detection
- Ask user if discovered facts should be added to `data_dictionary.md`

### Preserving Data Dictionary Content (CRITICAL)

When updating schema or adding new tables, **never overwrite** `data_dictionary.md`. User-added notes, relationships, and query patterns are valuable and must be preserved.

**Always merge, never replace:**
- Add new table sections for newly discovered tables
- Add new columns to existing table sections
- Preserve ALL existing user-written content: notes, enum values, relationships, query patterns

**For removed tables/columns:**
- Ask user before removing documentation
- Data may have moved to another table or been renamed

**Implementation:**
- Use the Edit tool for surgical updates, NOT the Write tool to regenerate
- When adding a new table, append a new section rather than rewriting the file

## Safety Guidelines

- Generate ONLY read-only queries (SELECT statements)
- NEVER generate INSERT, UPDATE, DELETE, DROP, TRUNCATE, or any data-modifying statements
- If user asks for data modification, politely explain you can only generate read queries
- Always validate columns exist before using them
- When unsure about data meaning, ask the user for clarification rather than guessing

## What You Cannot Do

- Generate data-modifying queries (INSERT, UPDATE, DELETE)
- Access tables not in the schema files
- Guarantee query performance without knowing indexes
- Know the exact current state of the data
- **Execute queries without explicit user request** - By default, only display queries for the user to run themselves

If asked to do something outside your capabilities, explain the limitation clearly.

## Quick Reference

### tables_inventory.json Format

Paths are stored as provided by the user (relative by default, absolute if user specified).

```json
{
  "generated_at": "2025-12-11T17:30:00Z",
  "duckdb_version": "v1.3.2",
  "sources": [
    {
      "file_path": "data/sales.ddb",
      "file_name": "sales.ddb",
      "file_type": "ddb",
      "tables": ["customers", "orders", "order_items"]
    },
    {
      "file_path": "data/transactions.csv",
      "file_name": "transactions.csv",
      "file_type": "csv",
      "tables": ["transactions"]
    },
    {
      "file_path": "data/events.parquet",
      "file_name": "events.parquet",
      "file_type": "parquet",
      "tables": ["events"]
    },
    {
      "file_path": "logs/*.csv",
      "file_name": "logs/*.csv",
      "file_type": "csv_glob",
      "glob_pattern": "logs/*.csv",
      "matched_files": ["jan.csv", "feb.csv", "mar.csv"],
      "tables": ["logs"]
    }
  ],
  "tables": {
    "customers": {
      "source_file": "sales.ddb",
      "file_type": "ddb",
      "columns": [
        {"name": "customer_id", "type": "INTEGER"},
        {"name": "name", "type": "VARCHAR"},
        {"name": "email", "type": "VARCHAR"},
        {"name": "created_at", "type": "DATE"}
      ]
    },
    "transactions": {
      "source_file": "transactions.csv",
      "file_type": "csv",
      "columns": [
        {"name": "transaction_id", "type": "INTEGER"},
        {"name": "customer_id", "type": "INTEGER"},
        {"name": "amount", "type": "DOUBLE"},
        {"name": "transaction_date", "type": "DATE"}
      ]
    },
    "events": {
      "source_file": "events.parquet",
      "file_type": "parquet",
      "columns": [
        {"name": "event_id", "type": "BIGINT"},
        {"name": "event_type", "type": "VARCHAR"},
        {"name": "timestamp", "type": "TIMESTAMP"}
      ]
    },
    "logs": {
      "source_file": "logs/*.csv",
      "file_type": "csv_glob",
      "glob_pattern": "logs/*.csv",
      "columns": [
        {"name": "log_level", "type": "VARCHAR"},
        {"name": "message", "type": "VARCHAR"},
        {"name": "timestamp", "type": "TIMESTAMP"}
      ]
    }
  }
}
```

**File type values:**
- `ddb` - DuckDB database file (multiple tables, use ATTACH for cross-file queries)
- `csv` - Single CSV file (one table, query via file path)
- `parquet` - Single Parquet file (one table, query via file path)
- `csv_glob` - Multiple CSV files as single table (query via glob pattern)
- `parquet_glob` - Multiple Parquet files as single table (query via glob pattern)

### DuckDB CLI Commands

**For .ddb files:**
```bash
# Get tables from a database
duckdb sales.ddb -c "SELECT table_name FROM information_schema.tables WHERE table_schema='main';"

# Get columns for a table
duckdb sales.ddb -c "DESCRIBE customers;"

# Generate schema file
duckdb sales.ddb -c ".schema" > duckdb_sql_assets/schema_sales.sql

# Sample distinct values for enum detection
duckdb sales.ddb -c "SELECT DISTINCT status FROM orders LIMIT 25;"
```

**For .csv files:**
```bash
# Get inferred schema
duckdb -c "DESCRIBE SELECT * FROM 'data/transactions.csv';"

# Sample data
duckdb -c "SELECT * FROM 'data/transactions.csv' LIMIT 10;"

# Sample distinct values for enum detection
duckdb -c "SELECT DISTINCT status FROM 'data/transactions.csv' LIMIT 25;"

# Query with explicit options (if auto-detect fails)
duckdb -c "SELECT * FROM read_csv('data/transactions.csv', header=true, delim=',');"
```

**For .parquet files:**
```bash
# Get embedded schema
duckdb -c "DESCRIBE SELECT * FROM 'data/events.parquet';"

# Sample data
duckdb -c "SELECT * FROM 'data/events.parquet' LIMIT 10;"

# Sample distinct values for enum detection
duckdb -c "SELECT DISTINCT category FROM 'data/events.parquet' LIMIT 25;"
```

**General:**
```bash
# Check DuckDB version
duckdb -version

# Expand glob pattern to see matched files
ls logs/*.csv
```

Overview

This skill helps you generate, review, and improve DuckDB SQL queries for .ddb, CSV, and Parquet sources. It prioritizes existing project documentation in duckdb_sql_assets/ and is display-only by default, producing ready-to-run queries and guidance for in-memory DuckDB sessions.

How this skill works

On every request the skill first checks duckdb_sql_assets/ for tables_inventory.json, schema_*.sql, and data_dictionary.md and uses those files as the single source of truth. It produces DuckDB-compliant SQL (including ATTACH preambles for .ddb files) and never runs queries unless you explicitly request execution. For new projects it guides you through a controlled initialization flow to generate assets and schema files.

When to use it

  • You need a DuckDB SQL query written for .ddb, .csv, or .parquet data.
  • You want to explore or document what tables and fields exist in a project.
  • You need an existing SQL reviewed or optimized for DuckDB.
  • You want safe, copy-paste-ready queries with proper ATTACH and path conventions.
  • You want to initialize duckdb assets (use initialization trigger phrases).

Best practices

  • Always let the skill read duckdb_sql_assets/ first — those files are the source of truth.
  • Default behavior is to display SQL only; explicitly request 'run', 'execute', or 'show results' to run queries.
  • Use relative paths by default; only use absolute paths when you explicitly provide them.
  • Include ATTACH statements for .ddb files using the _db_<slug> alias convention.
  • Cross-reference data_dictionary.md and schema_*.sql when writing joins, enums, and filters.

Example use cases

  • Generate a query that joins a table inside sales.ddb with a transactions CSV and filters by recent dates.
  • Review and optimize a provided SELECT with joins and GROUP BY for DuckDB performance.
  • Explain which tables and key fields exist using tables_inventory.json and data_dictionary.md.
  • Initialize duckdb_sql_assets/ for a new project by providing file paths or glob patterns.
  • Detect likely enum columns during initial asset creation and add them to the data dictionary.

FAQ

What if duckdb_sql_assets/ exists?

I will read tables_inventory.json, schema_*.sql, and data_dictionary.md and generate queries using those assets. I will not access .ddb files directly or re-run schema extraction.

Will you execute queries?

No — I only generate and display SQL by default. I will execute only if you explicitly ask me to run a query or show results.