home / skills / astronomer / agents / analyzing-data

analyzing-data skill

/skills/analyzing-data

This skill analyzes data by querying the data warehouse to answer business questions with actionable insights.

npx playbooks add skill astronomer/agents --skill analyzing-data

Review the files below or copy the command above to add this skill to your agents.

Files (25)
SKILL.md
4.0 KB
---
name: analyzing-data
description: Queries data warehouse and answers business questions about data. Handles questions requiring database/warehouse queries including "who uses X", "how many Y", "show me Z", "find customers", "what is the count", data lookups, metrics, trends, or SQL analysis.
---

# Data Analysis

Answer business questions by querying the data warehouse. The kernel auto-starts on first `exec` call.

**All CLI commands below are relative to this skill's directory.** Before running any `scripts/cli.py` command, `cd` to the directory containing this file.

## Workflow

1. **Pattern lookup** — Check for a cached query strategy:
   ```bash
   uv run scripts/cli.py pattern lookup "<user's question>"
   ```
   If a pattern exists, follow its strategy. Record the outcome after executing:
   ```bash
   uv run scripts/cli.py pattern record <name> --success  # or --failure
   ```

2. **Concept lookup** — Find known table mappings:
   ```bash
   uv run scripts/cli.py concept lookup <concept>
   ```

3. **Table discovery** — If cache misses, search the codebase (`Grep pattern="<concept>" glob="**/*.sql"`) or query `INFORMATION_SCHEMA`. See [reference/discovery-warehouse.md](reference/discovery-warehouse.md).

4. **Execute query**:
   ```bash
   uv run scripts/cli.py exec "df = run_sql('SELECT ...')"
   uv run scripts/cli.py exec "print(df)"
   ```

5. **Cache learnings** — Always cache before presenting results:
   ```bash
   # Cache concept → table mapping
   uv run scripts/cli.py concept learn <concept> <TABLE> -k <KEY_COL>
   # Cache query strategy (if discovery was needed)
   uv run scripts/cli.py pattern learn <name> -q "question" -s "step" -t "TABLE" -g "gotcha"
   ```

6. **Present findings** to user.

## Kernel Functions

| Function | Returns |
|----------|---------|
| `run_sql(query, limit=100)` | Polars DataFrame |
| `run_sql_pandas(query, limit=100)` | Pandas DataFrame |

`pl` (Polars) and `pd` (Pandas) are pre-imported.

## CLI Reference

### Kernel

```bash
uv run scripts/cli.py warehouse list      # List warehouses
uv run scripts/cli.py start [-w name]     # Start kernel (with optional warehouse)
uv run scripts/cli.py exec "..."          # Execute Python code
uv run scripts/cli.py status              # Kernel status
uv run scripts/cli.py restart             # Restart kernel
uv run scripts/cli.py stop                # Stop kernel
uv run scripts/cli.py install <pkg>       # Install package
```

### Concept Cache

```bash
uv run scripts/cli.py concept lookup <name>                     # Look up
uv run scripts/cli.py concept learn <name> <TABLE> -k <KEY_COL> # Learn
uv run scripts/cli.py concept list                               # List all
uv run scripts/cli.py concept import -p /path/to/warehouse.md   # Bulk import
```

### Pattern Cache

```bash
uv run scripts/cli.py pattern lookup "question"                                      # Look up
uv run scripts/cli.py pattern learn <name> -q "..." -s "..." -t "TABLE" -g "gotcha"  # Learn
uv run scripts/cli.py pattern record <name> --success                                # Record outcome
uv run scripts/cli.py pattern list                                                   # List all
uv run scripts/cli.py pattern delete <name>                                          # Delete
```

### Table Schema Cache

```bash
uv run scripts/cli.py table lookup <TABLE>            # Look up schema
uv run scripts/cli.py table cache <TABLE> -c '[...]'  # Cache schema
uv run scripts/cli.py table list                       # List cached
uv run scripts/cli.py table delete <TABLE>             # Delete
```

### Cache Management

```bash
uv run scripts/cli.py cache status                # Stats
uv run scripts/cli.py cache clear [--stale-only]  # Clear
```

## References

- [reference/discovery-warehouse.md](reference/discovery-warehouse.md) — Large table handling, warehouse exploration, INFORMATION_SCHEMA queries
- [reference/common-patterns.md](reference/common-patterns.md) — SQL templates for trends, comparisons, top-N, distributions, cohorts

Overview

This skill queries a data warehouse and answers business questions by running SQL and returning dataframes. It supports lookups, table discovery, cached query strategies, and repeatable analysis workflows to produce metrics, trends, and record-level results. The skill is designed for analysts and agents that need reliable, explainable data lookups and aggregated answers.

How this skill works

The skill accepts a natural-language question, then checks cached query patterns and concept-to-table mappings. If cache misses occur, it discovers tables via code search or INFORMATION_SCHEMA, builds and executes SQL with run_sql or run_sql_pandas, and returns a dataframe. Successful strategies and discovered mappings are cached for faster future answers.

When to use it

  • Answering product or marketing questions like "who uses feature X" or "find customers who did Y"
  • Calculating metrics and counts such as "how many active users" or "what is the conversion rate"
  • Running ad-hoc lookups or record-level queries for support and investigations
  • Exploring trends and distributions over time or cohorts
  • Automating recurring data requests where caching improves speed and consistency

Best practices

  • Start by looking up existing patterns and concepts to reuse validated queries
  • Always cache newly discovered concept → table mappings and successful patterns
  • Limit result sizes during exploration and use aggregated queries for production answers
  • Document gotchas and key columns when saving pattern strategies
  • Use INFORMATION_SCHEMA and codebase search for robust table discovery before guessing schemas

Example use cases

  • Find all customers who churned in the last 30 days and return contact identifiers
  • Report weekly active users and compare month-over-month trends
  • Identify top 10 products by revenue for the last quarter
  • Lookup a user by email and return associated events and properties
  • Investigate sudden metric drops by extracting event-level rows and aggregating by hour

FAQ

What functions run queries and return results?

Use run_sql(query, limit=100) for a Polars dataframe or run_sql_pandas(query, limit=100) for a Pandas dataframe.

How do I speed up repeated questions?

Save concept mappings and query patterns after discovery using the concept learn and pattern learn commands so future requests can reuse them.