home / skills / gwenwindflower / .charmschool / dbt-projects

dbt-projects skill

/agents/claude/skills/dbt-projects

This skill documents and tests dbt projects with YAML conventions, sources, macros, and CI patterns to guide when designing, validating, and auditing models.

npx playbooks add skill gwenwindflower/.charmschool --skill dbt-projects

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
7.0 KB
---
name: dbt-projects
description: "TODO: Complete and informative explanation of what the skill does and when to use it. Include WHEN to use this skill - specific scenarios, file types, or tasks that trigger it."
---
# dbt Project Conventions

## YAML Conventions

### Model Documentation (`*.yml`)

Each model file can contain:

- **models**: Documentation and column tests
- **data_tests**: Model-level tests
- **unit_tests**: Unit test definitions
- **semantic_models**: MetricFlow semantic layer
- **metrics**: Metric definitions
- **saved_queries**: Pre-defined metric queries

Example structure:

```yaml
models:
  - name: orders
    description: One row per order.
    data_tests:
      - dbt_utils.expression_is_true:
          expression: "order_total = subtotal + tax_paid"
    columns:
      - name: order_id
        description: Primary key.
        data_tests:
          - not_null
          - unique
```

### Source Definitions (`__sources.yml`)

Sources are defined in `models/staging/__sources.yml`:

```yaml
sources:
  - name: ecom
    schema: raw
    freshness:
      warn_after: {count: 24, period: hour}
    tables:
      - name: raw_orders
        loaded_at_field: ordered_at
```

## Macros

### `cents_to_dollars(column_name)`

Converts cents to dollars with proper decimal handling. Uses adapter dispatch for cross-database compatibility.

```sql
{{ cents_to_dollars('amount') }}  -- Returns: (amount / 100)::numeric(16, 2)
```

### `generate_schema_name(custom_schema_name, node)`

Custom schema routing:

- Seeds → `raw` schema (always)
- Prod target → `<default>_<custom>` schema naming
- Non-prod targets → default target schema

## Testing Patterns

### Data Tests

Column-level tests in YAML:

```yaml
columns:
  - name: order_id
    data_tests:
      - not_null
      - unique
      - relationships:
          to: ref('stg_orders')
          field: order_id
```

### Expression Tests

Model-level validation using `dbt_utils.expression_is_true`:

```yaml
data_tests:
  - dbt_utils.expression_is_true:
      expression: "order_total - tax_paid = subtotal"
```

### Unit Tests

Test business logic with mocked inputs:

```yaml
unit_tests:
  - name: test_order_items_compute_to_bools_correctly
    model: orders
    given:
      - input: ref('order_items')
        rows:
          - {order_id: 1, is_drink_item: true, is_food_item: false}
      - input: ref('stg_orders')
        rows:
          - {order_id: 1}
    expect:
      rows:
        - {order_id: 1, is_drink_order: true, is_food_order: false}
```

### dbt Packages Used

| Package | Version | Purpose |
| --------- | --------- | --------- |
| `dbt-labs/dbt_utils` | 1.1.1 | Common macros and tests |
| `calogica/dbt_date` | 0.10.0 | Date utilities |
| `dbt-labs/dbt-audit-helper` | main | Audit/comparison helpers |

## CI/CD

### GitHub Actions (`.github/workflows/ci.yml`)

PRs to `main` or `staging` trigger dbt Cloud CI jobs against:

- Snowflake
- BigQuery
- Postgres

Jobs create isolated schemas: `dbt_jsdx__pr_<branch_name>`

### Branching Strategy (WAP Flow)

- `main`: Production (read-only, deploys to `prod` schema)
- `staging`: Pre-production testing
- Feature branches: Branch from `staging`, merge back to `staging`

## Important Gotchas

1. **Seed paths**: The sample data lives in `jaffle-data/`. To use it, add `"jaffle-data"` to `seed-paths` in `dbt_project.yml`, run `dbt seed`, then remove it.

2. **dbt Cloud required**: This project is designed for dbt Cloud. Local dbt Core works but SQLFluff uses `dbt-cloud` templater.

3. **Raw data schema**: All source tables expect a `raw` schema in your warehouse.

4. **Schema generation**: The custom `generate_schema_name` macro routes schemas differently in prod vs non-prod environments.

5. **Packages first**: Always run `dbt deps` before `dbt build` on a fresh clone.

6. **Semantic layer**: The project includes MetricFlow semantic models and metrics, which require dbt Cloud to query interactively.

## Adding New Models

dbt Models fit into 3 broad categories: staging, intermediate, and marts.

Staging models are the rawest, closest to the source data, and should be simple transformations like renaming, type casting, and basic cleaning. They maintain a 1-to-1 relationship with a single raw source table, except in extremely rare circumstances (e.g., if the physical data from your company's orders landed in 5 identical tables split by global region, and you never interacted with those tables separately, you might choose to unify them in the staging layer into a single `stg_orders` table).

Intermediate models usually have verb names, labelled by the transformational action they apply. For instance, `int_aggregate_orders_by_customer` would be a table that joins `stg_orders` to `stg_customers` and aggregates a variety of orders facts at the customer grain. All models from the intermediate layer select from _staging models or other intermediate models **only**_, never from raw source data. The only models that should contain a `{{source()}}` macro are staging models.

Mart models are the most refined, business-facing models, they present a combination of staging and/or intermediate models to create datasets ready for analysis, BI, AI engineering, or data science work. Marts are the source of truth for the business, the user-facing end of the dbt transformation pipeline.

There are different philosophies on marts, which have upstream effects on how you design and construct your DAG. Many people build towards a star schema, which allows flexible combinations of facts and dimensions, but the other end of the spectrum, the "OBT" (One Big Table) approach is also extremely popular. OBT modeling aims to provide a convenient, pre-computed table that can service the most common query patterns in a simple and performant way, at the cost of flexibility and storage efficiency. Modern data platforms - with cheap storage separated from compute, and highly optimized query engines running on powerful, scalable machines - have changed how people weigh these options significantly.

In reality, there are many shades in-between these approaches, purists are rare, and it is not uncommon to have a mixture of both. Building a rigorous star schema of marts, then constructing a secondary layer of OBT-style marts on top of that to provide quick and easy access to key metrics is probably the most common approach in real, everyday work. Ultimately, the right approach depends on the specifics of the business and stakeholder needs.

### Staging Model Checklist

1. Create `models/staging/stg_<entity>.sql` with source → renamed CTE pattern
2. Add source table to `models/staging/__sources.yml` if new
3. Create `models/staging/stg_<entity>.yml` with description and tests
4. Follow lowercase, explicit alias, 4-space indent conventions

### Mart Model Checklist

1. Create `models/marts/<entity>.sql` joining staging models
2. Create `models/marts/<entity>.yml` with:
   - Model description
   - Column descriptions and tests
   - Semantic model definition (optional)
   - Metrics (optional)
3. Materialization defaults to `table` (configured in `dbt_project.yml`)

Overview

This skill codifies dbt project conventions, tests, macros, and CI patterns so teams can scaffold and maintain consistent analytics engineering repositories. It documents YAML structures for models and sources, reusable macros (like cents_to_dollars and generate_schema_name), testing patterns (data, expression, and unit tests), and branch/CI expectations. Use it to align new models, seeds, and deployments with established best practices and to avoid common gotchas.

How this skill works

The skill inspects dbt YAML model files, source definition files, macro usage, and CI workflow rules to verify conventions and suggest actions. It highlights where to place staging, intermediate, and mart models, enforces test patterns and seed handling, and explains schema routing via the generate_schema_name macro. It also summarizes package dependencies and the expected dbt Cloud-centric CI flow.

When to use it

  • Onboarding a new data engineer to a dbt repository to learn project conventions and file locations
  • Adding a new model, source, or seed and needing explicit checklist steps (staging → intermediate → mart)
  • Writing model YAML for documentation, column tests, semantic models, or metrics
  • Configuring CI/CD or preparing a PR to run dbt Cloud jobs across Snowflake, BigQuery, or Postgres
  • Debugging failing tests or schema-routing issues caused by environment differences

Best practices

  • Place source definitions in models/staging/__sources.yml and add loaded_at_field when available
  • Run dbt deps before dbt build on fresh clones and include jaffle-data in seed-paths only when seeding
  • Keep staging models simple (rename/type-cast only) and reference source() only in staging
  • Name intermediate models with verb-style names and restrict them to upstream staging/intermediate models
  • Materialize marts as tables by default and include explicit descriptions, column tests, and optional metrics

Example use cases

  • Create a new stg_customers model: add stg SQL, source entry, and stg_customers.yml with tests
  • Write a model-level expression test validating business math with dbt_utils.expression_is_true
  • Add a cents_to_dollars macro call to display monetary values consistently across warehouses
  • Implement a unit test that mocks inputs for a complex transform to verify boolean logic
  • Troubleshoot schema naming differences by reviewing generate_schema_name behavior for prod vs non-prod

FAQ

Do I need dbt Cloud to use everything in this project?

The project is designed for dbt Cloud; many CI and interactive semantic features expect dbt Cloud, though core models and tests run with dbt Core locally.

Where should I put sample seed data?

Place sample CSVs in jaffle-data and add it temporarily to seed-paths, run dbt seed, then remove jaffle-data from seed-paths when finished.