home / skills / sfc-gh-dflippo / snowflake-dbt-demo / dbt-modeling

dbt-modeling skill

safe

This skill helps you write production-quality dbt models by applying consistent CTE patterns, modular layers, and clear SQL structure.

npx playbooks add skill sfc-gh-dflippo/snowflake-dbt-demo --skill dbt-modeling

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

14.4 KB

---
name: dbt-modeling
description:
  Writing dbt models with proper CTE patterns, SQL structure, and layer-specific templates. Use this
  skill when writing or refactoring dbt models, implementing CTE patterns, creating
  staging/intermediate/mart models, or ensuring proper SQL structure and dependencies.
---

# dbt Modeling

## Purpose

Transform AI agents into experts on writing production-quality dbt models, providing guidance on CTE
patterns, SQL structure, and best practices for creating maintainable and performant data models
across all medallion architecture layers.

## When to Use This Skill

Activate this skill when users ask about:

- Writing or refactoring dbt models
- Implementing CTE (Common Table Expression) patterns
- Creating staging, intermediate, or mart models
- Structuring SQL for readability and maintainability
- Implementing proper model dependencies with ref() and source()
- Converting existing SQL to dbt model format
- Debugging model SQL structure issues

---

## CTE Pattern Structure

All dbt models should follow this consistent CTE pattern for readability and maintainability:

```sql
-- Import CTEs (staging and intermediate models)
with customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('stg_orders') }}
),

-- Logical CTEs (business logic)
customer_metrics as (
    select
        customer_id,
        count(*) as order_count,
        sum(order_amount) as lifetime_value
    from orders
    group by customer_id
),

-- Final CTE (column selection and standardization)
final as (
    select
        -- Primary key
        customers.customer_id,

        -- Attributes
        customers.customer_name,
        customers.customer_email,

        -- Metrics
        coalesce(customer_metrics.order_count, 0) as lifetime_orders,
        coalesce(customer_metrics.lifetime_value, 0) as lifetime_value,

        -- Metadata
        current_timestamp() as dbt_updated_at

    from customers
    left join customer_metrics
        on customers.customer_id = customer_metrics.customer_id
)

select * from final
```

**Official dbt Documentation**:
[How we structure our dbt projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview)

---

## Bronze Layer: Staging Model Template

**Purpose**: One-to-one with source tables. Clean and standardize only.

**Materialization**: `ephemeral` (set at folder level in dbt_project.yml)

### Basic Staging Model

```sql
-- models/bronze/stg_salesforce__accounts.sql

select
    -- Primary key
    id as account_id,

    -- Attributes
    name as account_name,
    type as account_type,
    industry,

    -- Standardized fields
    upper(trim(email)) as email_clean,
    cast(annual_revenue as number) as annual_revenue,

    -- Metadata
    current_timestamp() as dbt_loaded_at

from {{ source('salesforce', 'accounts') }}
```

### Staging Model with Light Cleaning

```sql
-- models/bronze/stg_ecommerce__customers.sql

select
    -- Primary key
    customer_id,

    -- Attributes (cleaned)
    trim(first_name) as first_name,
    trim(last_name) as last_name,
    lower(trim(email)) as email,

    -- Phone standardization
    regexp_replace(phone, '[^0-9]', '') as phone_clean,

    -- Boolean conversions
    case
        when status = 'active' then true
        else false
    end as is_active,

    -- Date standardization
    cast(created_at as timestamp) as created_at,

    -- Metadata
    current_timestamp() as dbt_loaded_at

from {{ source('ecommerce', 'customers') }}
```

### Staging Rules

✅ **DO**:

- Use `{{ source() }}` for source references
- Rename columns to standard naming conventions
- Cast data types explicitly
- Clean data (trim, upper/lower, standardize formats)
- Add metadata columns (dbt_loaded_at, dbt_updated_at)

❌ **DON'T**:

- Join multiple sources
- Add business logic or calculations
- Aggregate or group data
- Filter rows (except for obvious bad data)
- Hard-code table names

---

## Silver Layer: Intermediate Model Template

**Purpose**: Reusable business logic, enrichment, and complex transformations.

**Materialization**: `ephemeral` or `table` (set at folder level, override if needed)

### Basic Intermediate Model

```sql
-- models/silver/int_customers__with_orders.sql

with customers as (
    select * from {{ ref('stg_salesforce__customers') }}
),

orders as (
    select * from {{ ref('stg_ecommerce__orders') }}
),

customer_order_metrics as (
    select
        customer_id,
        count(distinct order_id) as total_orders,
        sum(order_amount) as lifetime_value,
        min(order_date) as first_order_date,
        max(order_date) as last_order_date
    from orders
    group by customer_id
),

final as (
    select
        c.customer_id,
        c.customer_name,
        c.customer_email,

        coalesce(m.total_orders, 0) as total_orders,
        coalesce(m.lifetime_value, 0) as lifetime_value,
        m.first_order_date,
        m.last_order_date,

        datediff(day, m.first_order_date, m.last_order_date) as customer_tenure_days

    from customers c
    left join customer_order_metrics m
        on c.customer_id = m.customer_id
)

select * from final
```

### Complex Intermediate Model with Business Logic

```sql
-- models/silver/int_customers__segmented.sql

with customer_metrics as (
    select * from {{ ref('int_customers__with_orders') }}
),

rfm_scores as (
    select
        customer_id,

        -- Recency (days since last order)
        datediff(day, last_order_date, current_date()) as recency_days,

        -- Frequency
        total_orders as frequency,

        -- Monetary
        lifetime_value as monetary,

        -- Quartile scoring
        ntile(4) over (order by datediff(day, last_order_date, current_date()) desc) as recency_score,
        ntile(4) over (order by total_orders) as frequency_score,
        ntile(4) over (order by lifetime_value) as monetary_score

    from customer_metrics
    where last_order_date is not null
),

final as (
    select
        customer_id,
        recency_days,
        frequency,
        monetary,
        recency_score,
        frequency_score,
        monetary_score,

        -- RFM segment
        case
            when recency_score >= 3 and frequency_score >= 3 and monetary_score >= 3 then 'Champions'
            when recency_score >= 3 and frequency_score >= 2 then 'Loyal Customers'
            when recency_score >= 3 and monetary_score >= 3 then 'Big Spenders'
            when recency_score <= 2 and frequency_score >= 3 then 'At Risk'
            when recency_score <= 1 then 'Lost'
            else 'Regular'
        end as customer_segment

    from rfm_scores
)

select * from final
```

### Intermediate Rules

✅ **DO**:

- Reference staging and other intermediate models with `{{ ref() }}`
- Add business logic and calculations
- Create reusable components
- Use descriptive CTE names
- Group related logic into CTEs

❌ **DON'T**:

- Reference `{{ source() }}` directly
- Add presentation-layer logic (save for marts)
- Create one-off transformations (ensure reusability)

---

## Gold Layer: Dimension Model Template

**Purpose**: Business entities ready for BI tools.

**Materialization**: `table` (set at folder level)

### Basic Dimension

```sql
-- models/gold/dim_customers.sql

with customers as (
    select * from {{ ref('int_customers__segmented') }}
)

select
    -- Primary key
    customer_id,

    -- Attributes
    customer_name,
    customer_email,

    -- Metrics
    total_orders,
    lifetime_value,
    first_order_date,
    last_order_date,
    customer_tenure_days,

    -- Segmentation
    customer_segment,

    -- Business classification
    case
        when customer_segment = 'Champions' then 'High Value'
        when customer_segment in ('Loyal Customers', 'Big Spenders') then 'Medium Value'
        else 'Low Value'
    end as customer_value_tier,

    -- Flags
    case when last_order_date >= dateadd(day, -90, current_date()) then true else false end as is_active_90d,
    case when total_orders = 1 then true else false end as is_one_time_buyer,

    -- Metadata
    current_timestamp() as dbt_updated_at

from customers
```

### Dimension with Type 0 SCD (Ghost Records)

```sql
-- models/gold/dim_products.sql
-- Includes ghost key for unknown/missing products

with products as (
    select * from {{ ref('int_products__enriched') }}
),

ghost_key as (
    select
        -1 as product_id,
        'Unknown' as product_name,
        'Unknown' as product_category,
        0.00 as product_price,
        false as is_active,
        current_timestamp() as dbt_updated_at
),

final as (
    select * from products
    union all
    select * from ghost_key
)

select * from final
```

### Dimension Rules

✅ **DO**:

- Include primary key as first column
- Add business-friendly attributes
- Include calculated flags and classifications
- Add metadata columns
- Document all columns in schema.yml
- Test with `dbt_constraints.primary_key`

❌ **DON'T**:

- Include transaction-level data (that's for facts)
- Create overly wide tables (be selective)

---

## Gold Layer: Fact Model Template

**Purpose**: Business processes and transactions.

**Materialization**: `table` or `incremental` (override at model level for incremental)

### Basic Fact Table

```sql
-- models/gold/fct_orders.sql

with orders as (
    select * from {{ ref('stg_ecommerce__orders') }}
),

customers as (
    select customer_id from {{ ref('dim_customers') }}
),

products as (
    select product_id from {{ ref('dim_products') }}
)

select
    -- Primary key
    orders.order_id,

    -- Foreign keys
    coalesce(customers.customer_id, -1) as customer_id,  -- Ghost key for unknown
    coalesce(products.product_id, -1) as product_id,

    -- Attributes
    orders.order_date,
    orders.order_status,

    -- Metrics
    orders.order_quantity,
    orders.order_amount,
    orders.discount_amount,
    orders.tax_amount,
    orders.total_amount,

    -- Metadata
    current_timestamp() as dbt_updated_at

from orders
left join customers on orders.customer_id = customers.customer_id
left join products on orders.product_id = products.product_id
```

### Incremental Fact Table

```sql
-- models/gold/fct_order_lines.sql
{{ config(
    materialized='incremental',
    unique_key='order_line_id',
    incremental_strategy='merge',
    merge_exclude_columns=['dbt_inserted_at'],
    cluster_by=['order_date']
) }}

with order_lines as (
    select * from {{ ref('stg_ecommerce__order_lines') }}
)

select
    -- Primary key
    order_line_id,

    -- Foreign keys
    order_id,
    product_id,
    customer_id,

    -- Attributes
    order_date,
    line_number,

    -- Metrics
    quantity,
    unit_price,
    discount_percent,
    line_total,

    -- Metadata
    {% if is_incremental() %}
        dbt_inserted_at,  -- Preserve from merge_exclude_columns
    {% else %}
        current_timestamp() as dbt_inserted_at,
    {% endif %}
    current_timestamp() as dbt_updated_at

from order_lines

{% if is_incremental() %}
    where order_date > (select max(order_date) from {{ this }})
{% endif %}
```

### Fact Rules

✅ **DO**:

- Include all foreign keys to dimensions
- Use ghost keys (-1) for unknown/missing references
- Include metrics and measures
- Use incremental for large tables (millions+ rows)
- Add clustering keys for large tables
- Test with `dbt_constraints.foreign_key`

❌ **DON'T**:

- Denormalize dimension attributes into facts (use foreign keys)
- Skip foreign key tests

---

## Model Configuration Strategy

### Folder-Level First (in dbt_project.yml)

Most configuration should be at the folder level:

```yaml
models:
  your_project:
    bronze:
      +materialized: ephemeral
      +tags: ["bronze", "staging"]
    silver:
      +materialized: ephemeral
      +tags: ["silver"]
    gold:
      +materialized: table
      +tags: ["gold", "marts"]
```

### Model-Level Only for Unique Requirements

Add `{{ config() }}` ONLY when overriding folder defaults:

```sql
-- Only for incremental-specific settings
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    cluster_by=['order_date']
) }}
```

---

## Common Modeling Patterns

### Handling NULL Values

```sql
-- Use COALESCE for safety
select
    customer_id,
    coalesce(total_orders, 0) as total_orders,
    coalesce(lifetime_value, 0.00) as lifetime_value
from {{ ref('customer_metrics') }}
```

### Window Functions for Ranking

```sql
-- Use QUALIFY in Snowflake for cleaner code
select
    customer_id,
    order_date,
    order_amount,
    row_number() over (partition by customer_id order by order_date desc) as order_rank
from {{ ref('stg_orders') }}
qualify order_rank <= 5  -- Top 5 orders per customer
```

### Conditional Aggregation

```sql
select
    customer_id,
    count(*) as total_orders,
    sum(case when order_status = 'completed' then 1 else 0 end) as completed_orders,
    sum(case when order_status = 'cancelled' then 1 else 0 end) as cancelled_orders
from {{ ref('stg_orders') }}
group by customer_id
```

---

## Helping Users with Modeling

### Strategy for Assisting Users

When users ask for modeling help:

1. **Understand the goal**: What business question does this answer?
2. **Identify the layer**: Bronze/silver/gold based on purpose
3. **Recommend structure**: CTEs, column organization, logic flow
4. **Apply naming conventions**: Proper prefixes and column names
5. **Provide complete example**: Working code they can adapt
6. **Suggest tests**: Appropriate constraints and validations

### Common User Questions

**"How do I write this model?"**

- Identify source data (staging models)
- Break logic into CTEs (import → logic → final)
- Apply column naming standards
- Add appropriate tests

**"How do I join these tables?"**

- Use CTE pattern (import CTEs at top)
- Perform joins in logical CTEs
- Select final columns in final CTE
- Use `ref()` for all dbt model references

**"Should this be ephemeral or table?"**

- Ephemeral: Staging, reusable intermediate logic
- Table: Dimensions, complex silver, production marts
- Incremental: Large facts (millions+ rows)

---

## Related Official Documentation

- [dbt Best Practices: How We Structure Our dbt Projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview)
- [dbt Docs: Building Models](https://docs.getdbt.com/docs/build/models)
- [dbt Docs: Jinja & Macros](https://docs.getdbt.com/docs/build/jinja-macros)

---

**Goal**: Transform AI agents into expert dbt modelers who write clean, maintainable,
production-quality SQL that follows industry best practices and is easy for teams to understand and
extend.

Overview

This skill turns the agent into a practical dbt modeling guide for creating production-quality models across bronze, silver, and gold layers. It focuses on consistent CTE patterns, layer-specific templates, materialization strategy, and dependency best practices. Use it to write, refactor, or validate dbt models for readability, reuse, and performance.

How this skill works

The skill inspects model purpose and suggests a CTE-first structure: import CTEs (refs/sources), logical CTEs (business transformations), and a final CTE for selected columns and metadata. It recommends layer-appropriate templates (staging/ephemeral, intermediate/ephemeral or table, dimensions/facts/table or incremental) and enforces ref()/source() usage, naming conventions, and tests. It also proposes config placement (folder-level vs model-level) and incremental patterns for large facts.

When to use it

Writing new dbt models for staging, intermediate, or mart layers
Refactoring SQL to follow consistent CTE and column-order patterns
Implementing incremental fact tables or merge strategies
Converting raw SQL into dbt models with proper ref()/source() usage
Designing or validating model dependencies and tests

Best practices

Follow the CTE pattern: import CTEs → logical CTEs → final CTE with metadata
Use {{ source() }} only in staging; use {{ ref() }} for all model-to-model references
Materialize at folder level; use {{ config() }} only for exceptions (incremental, clustering)
Keep staging one-to-one with sources: clean and cast, no joins or business logic
Use ghost keys (-1) for unknown dimension FK references and add dbt_updated_at/dbt_loaded_at metadata

Example use cases

Create an ephemeral staging model to standardize raw provider fields (trim, cast, normalize)
Build an intermediate model that calculates customer RFM scores and exposes reusable metrics
Author a dim_customers table with business-friendly attributes, flags, and SCD-friendly ghost record
Implement an incremental fct_order_lines with merge strategy and date-based incremental predicate
Refactor ad-hoc SQL into CTE-patterned dbt models with ref() references and schema.yml tests

FAQ

When should I choose ephemeral vs table materialization?

Use ephemeral for staging and reusable intermediate logic to avoid extra storage and simplify testing; use table for dimensions and production marts, and incremental for very large facts.

Where do I put model config?

Prefer folder-level configuration in dbt_project.yml for defaults. Use {{ config() }} in-model only to override defaults for unique needs like incremental_strategy or clustering.