home / skills / sfc-gh-dflippo / snowflake-dbt-demo / dbt-migration-validation

dbt-migration-validation skill

/.claude/skills/dbt-migration-validation

This skill validates dbt migrations by enforcing YAML and SQL rules, detects anti-patterns, and offers auto-fix suggestions to improve quality.

npx playbooks add skill sfc-gh-dflippo/snowflake-dbt-demo --skill dbt-migration-validation

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

12.9 KB

---
name: dbt-migration-validation
description:
  Comprehensive validation skill for dbt models and schema YAML files. Defines validation rules,
  common anti-patterns to detect, and auto-fix suggestions. Integrates with Claude Code hooks to
  enforce quality standards during migration.
---

# dbt Migration Validation Skill

## Purpose

Define and enforce validation rules for dbt models during migration. This skill provides
comprehensive validation rules, common anti-patterns to detect, and auto-fix suggestions that are
implemented by the validation hooks.

## When to Use This Skill

Activate this skill when:

- Reviewing dbt models for quality issues
- Diagnosing validation hook failures
- Understanding validation rules and their rationale
- Looking up auto-fix suggestions for common issues
- Configuring validation thresholds

---

## Validation Rules Reference

### Schema YAML Rules

#### YAML001: Model Description Required

**Severity:** Error

**Description:** Every model must have a description in the schema YAML file.

**Rationale:** Descriptions are essential for documentation, lineage understanding, and team
collaboration. They appear in dbt docs and help stakeholders understand data assets.

**Detection:** Model entry exists but `description` field is missing or empty.

**Fix:**

```yaml
## Before
models:
  - name: dim_customers
    columns:
      - name: customer_id

## After
models:
  - name: dim_customers
    description: |
      Customer dimension containing customer attributes, contact info,
      and segmentation. Updated daily from CRM system.

      Source: CRM Database
      Owner: Analytics Team
    columns:
      - name: customer_id
```

---

#### YAML002: Primary Key Test Required

**Severity:** Error

**Description:** Columns that appear to be primary keys (ending in `_id`, `_key`, `_sk`) must have a
`dbt_constraints.primary_key` test.

**Rationale:** Primary key validation ensures data integrity. The `dbt_constraints` package creates
actual database constraints for validation.

**Detection:** Column name matches primary key pattern but lacks required test.

**Primary Key Patterns:**

- `*_id` (e.g., `customer_id`, `order_id`)
- `*_key` (e.g., `surrogate_key`, `natural_key`)
- `*_sk` (e.g., `customer_sk`)
- `id` (exact match)

**Fix:**

```yaml
## Before
columns:
  - name: customer_id
    description: "Unique customer identifier"

## After
columns:
  - name: customer_id
    description: "Unique customer identifier"
    data_type: integer
    tests:
      - dbt_constraints.primary_key
```

---

#### YAML003: Foreign Key Relationship Test

**Severity:** Warning

**Description:** Columns that appear to be foreign keys should have a `relationships` or
`dbt_constraints.foreign_key` test.

**Rationale:** Relationship tests validate referential integrity between tables.

**Detection:** Column name matches foreign key pattern but lacks relationship test.

**Foreign Key Patterns:**

- `fk_*` (e.g., `fk_customer`)
- `*_fk` (e.g., `customer_fk`)

**Fix:**

```yaml
## Before
columns:
  - name: fk_customer_id
    description: "Reference to customer"

## After
columns:
  - name: fk_customer_id
    description: "Reference to customer"
    tests:
      - relationships:
          to: ref('dim_customers')
          field: customer_id
```

---

#### YAML004: Column Description

**Severity:** Warning

**Description:** All columns should have descriptions.

**Rationale:** Column descriptions improve documentation and make data assets more discoverable.

**Fix:**

```yaml
## Before
columns:
  - name: signup_date

## After
columns:
  - name: signup_date
    description: "Date the customer signed up for service"
```

---

#### YAML005: Model Naming Convention

**Severity:** Error

**Description:** Model names must follow layer-specific naming conventions.

**Conventions by Layer:**

| Layer               | Prefix                          | Pattern                       | Example                      |
| ------------------- | ------------------------------- | ----------------------------- | ---------------------------- |
| Bronze/Staging      | `stg_`                          | `stg_{source}__{table}`       | `stg_sqlserver__customers`   |
| Silver/Intermediate | `int_`, `lookup_`               | `int_{entity}__{description}` | `int_customers__with_orders` |
| Gold/Mart           | `dim_`, `fct_`, `mart_`, `agg_` | `dim_{entity}`                | `dim_customers`              |

**Fix:**

```yaml
## Before (in gold layer)
models:
  - name: customers

## After
models:
  - name: dim_customers
```

---

#### YAML006: Column Data Type

**Severity:** Warning

**Description:** All columns should have `data_type` specified.

**Rationale:** Explicit data types improve documentation and enable contract testing.

**Fix:**

```yaml
## Before
columns:
  - name: amount
    description: "Order amount"

## After
columns:
  - name: amount
    description: "Order amount in USD"
    data_type: number(18,2)
```

---

### SQL Model Rules

#### SQL001: Config Block

**Severity:** Warning

**Description:** Models should have a config block specifying materialization.

**Rationale:** Explicit configuration makes materialization strategy clear and enables per-model
customization.

**Fix:**

```sql
-- Before
select * from {{ ref('stg_customers') }}

-- After
{{ config(
    materialized='table',
    tags=['gold', 'customer']
) }}

select * from {{ ref('stg_customers') }}
```

---

#### SQL002: CTE Pattern

**Severity:** Warning

**Description:** Models should use the standard CTE pattern for readability.

**Pattern:**

1. Import CTEs - Reference source data
2. Logical CTEs - Apply transformations
3. Final CTE - Prepare output
4. Final SELECT from final CTE

**Fix:**

```sql
-- Before
select
    c.customer_id,
    c.customer_name,
    count(o.order_id) as order_count
from {{ ref('stg_customers') }} c
left join {{ ref('stg_orders') }} o on c.customer_id = o.customer_id
group by 1, 2

-- After
with customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('stg_orders') }}
),

customer_orders as (
    select
        c.customer_id,
        c.customer_name,
        count(o.order_id) as order_count
    from customers c
    left join orders o on c.customer_id = o.customer_id
    group by c.customer_id, c.customer_name
),

final as (
    select
        customer_id,
        customer_name,
        order_count
    from customer_orders
)

select * from final
```

---

#### SQL003: No SELECT \* in Final Output

**Severity:** Error

**Description:** The final query should explicitly list columns, not use `SELECT *`.

**Rationale:** Explicit columns make the contract clear, prevent accidental exposure of new columns,
and improve query performance.

**Exception:** `SELECT * FROM final` is acceptable when `final` CTE explicitly lists columns.

**Fix:**

```sql
-- Before (problematic)
select * from {{ ref('stg_customers') }}

-- After
select
    customer_id,
    customer_name,
    email,
    signup_date
from {{ ref('stg_customers') }}
```

---

#### SQL004: Use ref() and source()

**Severity:** Error

**Description:** All table references must use `{{ ref() }}` or `{{ source() }}`, not hardcoded
table names.

**Rationale:** Using ref() and source() enables:

- Automatic dependency tracking
- Environment-aware table resolution
- Proper lineage documentation

**Detection:** Pattern `FROM schema.table` or `JOIN database.schema.table` without Jinja braces.

**Fix:**

```sql
-- Before
select * from raw_data.customers

-- After (for source tables)
select * from {{ source('raw_data', 'customers') }}

-- After (for dbt models)
select * from {{ ref('stg_customers') }}
```

---

#### SQL005: Migration Header Comment

**Severity:** Warning

**Description:** Migrated models should include a header comment documenting the original source and
conversion notes.

**Detection:** Model contains migration indicators but lacks proper header.

**Required Header Elements:**

- Original object name and schema
- Source platform
- Migration date
- Conversion notes
- Breaking changes (if any)

**Template:**

```sql
/* Original Object: {schema}.{object_name}
   Source Platform: {SQL Server|Oracle|Teradata|etc.}
   Original Type: {Stored Procedure|View|Function}
   Migration Date: YYYY-MM-DD

   Conversion Notes:
   - Replaced ISNULL() with COALESCE()
   - Converted TOP to LIMIT
   - Removed NOLOCK hints

   Breaking Changes:
   - Output column 'old_name' renamed to 'new_name'
*/

{{ config(materialized='table') }}

-- Model implementation
```

---

#### SQL006: Snowflake-Incompatible Syntax

**Severity:** Error

**Description:** Model contains SQL syntax that is not compatible with Snowflake.

**Common Issues:**

| Pattern              | Platform   | Snowflake Equivalent     |
| -------------------- | ---------- | ------------------------ |
| `TOP N`              | SQL Server | `LIMIT N`                |
| `ISNULL(a, b)`       | SQL Server | `COALESCE(a, b)`         |
| `GETDATE()`          | SQL Server | `CURRENT_TIMESTAMP()`    |
| `LEN(s)`             | SQL Server | `LENGTH(s)`              |
| `CHARINDEX(a, b)`    | SQL Server | `POSITION(a IN b)`       |
| `CONVERT(type, val)` | SQL Server | `CAST(val AS type)`      |
| `WITH (NOLOCK)`      | SQL Server | (remove)                 |
| `@@ROWCOUNT`         | SQL Server | (use different approach) |
| `ROWNUM`             | Oracle     | `ROW_NUMBER() OVER()`    |
| `DECODE(...)`        | Oracle     | `CASE WHEN...`           |
| `CONNECT BY`         | Oracle     | Recursive CTE            |
| `SYSDATE`            | Oracle     | `CURRENT_DATE()`         |
| `SEL`                | Teradata   | `SELECT`                 |
| Backticks            | MySQL      | Double quotes            |

**Fix Examples:**

```sql
-- SQL Server Before
SELECT TOP 100 *
FROM customers WITH (NOLOCK)
WHERE ISNULL(email, '') = ''
  AND GETDATE() > signup_date

-- Snowflake After
SELECT *
FROM customers
WHERE COALESCE(email, '') = ''
  AND CURRENT_TIMESTAMP() > signup_date
LIMIT 100
```

---

## Anti-Patterns to Detect

### Common Anti-Patterns

#### 1. Direct Source References

**Problem:** Using hardcoded table names instead of `source()` or `ref()`.

**Impact:** Breaks dependency tracking, environment portability, and lineage.

#### 2. Generic Column Names

**Problem:** Columns named `col1`, `field1`, `temp`, etc.

**Impact:** Poor documentation, confusing for consumers.

#### 3. Mixed Naming Conventions

**Problem:** Inconsistent casing or naming patterns within a model.

**Impact:** Confusion, maintenance difficulty.

#### 4. Missing Tests on Key Columns

**Problem:** Primary/foreign keys without uniqueness or relationship tests.

**Impact:** Data quality issues may go undetected.

#### 5. Overly Complex Models

**Problem:** Models with excessive CTEs, complex logic, or doing too much.

**Impact:** Hard to maintain, test, and understand.

#### 6. Platform-Specific Syntax

**Problem:** SQL syntax from source database that won't work in Snowflake.

**Impact:** Runtime errors, compilation failures.

---

## Hook Integration

### Validation Hook Configuration

Hooks are configured in `.claude/settings.local.json`:

```json
{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write",
        "hooks": [
          {
            "type": "command",
            "command": "python3 .claude/hooks/dbt-validation/validate_file.py \"$FILE_PATH\"",
            "timeout": 30000
          }
        ]
      },
      {
        "matcher": "Edit",
        "hooks": [
          {
            "type": "command",
            "command": "python3 .claude/hooks/dbt-validation/validate_file.py \"$FILE_PATH\"",
            "timeout": 30000
          }
        ]
      }
    ]
  }
}
```

### Exit Codes

- **Exit 0**: Validation passed (or file not in scope)
- **Exit 1**: Validation failed with errors

Warnings are reported but don't cause exit code 1.

### File Scope

Validation runs only on files matching:

- `models/**/_models.yml` - Schema YAML validation
- `models/**/_sources.yml` - Source YAML validation
- `models/**/*.sql` - SQL model validation

Other files are skipped (exit 0).

---

## Validation Scripts

Validation is implemented in `.claude/hooks/dbt-validation/`:

| Script                        | Purpose                                      |
| ----------------------------- | -------------------------------------------- |
| `validate_file.py`            | Entry point, routes to appropriate validator |
| `validate_schema_yaml.py`     | YAML rule validation                         |
| `validate_dbt_model.py`       | SQL rule validation                          |
| `check_migration_status.py`   | Cross-validation and reporting               |
| `rules/naming_conventions.py` | Naming convention checks                     |
| `rules/cte_patterns.py`       | CTE structure validation                     |
| `rules/snowflake_syntax.py`   | Platform syntax detection                    |

---

## Related Skills

- $dbt-migration - Migration workflow
- $dbt-testing - Test strategies
- $dbt-architecture - Naming conventions
- $dbt-modeling - CTE patterns

Overview

This skill provides comprehensive validation for dbt models and schema YAML files during migrations. It enforces validation rules, detects common anti-patterns, and offers auto-fix suggestions to keep models Snowflake-compatible and production-ready. It integrates with validation hooks to run checks automatically on edits and writes.

How this skill works

The skill inspects schema YAML and SQL model files to apply a ruleset for documentation, naming, tests, CTE structure, and Snowflake compatibility. It detects missing descriptions, absent primary/foreign key tests, SELECT * usage, hardcoded table references, and platform-specific syntax, then recommends concrete fixes. Validation runs via configured hooks and returns non-zero exit codes for errors while reporting warnings.

When to use it

During model migration from other platforms to Snowflake to catch incompatible syntax and breaking changes.
When reviewing pull requests to ensure models follow naming, testing, and documentation standards.
While diagnosing validation hook failures triggered on file edits or writes.
When configuring or enforcing team validation thresholds and automated checks.
To generate targeted auto-fix suggestions for common dbt anti-patterns.

Best practices

Always include model and column descriptions in schema YAML to improve docs and ownership.
Declare dbt_constraints.primary_key and relationship tests for key-like column names.
Use ref() and source() for all table references to preserve lineage and environment portability.
List explicit columns in final SELECTs; avoid SELECT * except from a final CTE that lists columns.
Follow layer-specific naming prefixes (stg_, int_, dim_, fct_) to make model intent clear.
Prefer standard CTE patterns and a config block with materialization for readability and consistency.

Example use cases

Validate a migrated SQL model to replace TOP/ISNULL/GETDATE() with Snowflake equivalents and add a migration header.
Scan schema YAML to add missing model descriptions and data_type declarations before releasing documentation.
Fail CI on critical errors like missing primary key tests or hardcoded table references.
Automatically run validators on file save to get immediate auto-fix suggestions for CTE structure or naming issues.
Generate a remediation checklist for large migration batches highlighting platform-specific fixes and breaking changes.

FAQ

What causes validation to return a non-zero exit code?

Errors such as missing model descriptions, missing primary key tests, SELECT * in final output, hardcoded table references, or Snowflake-incompatible syntax will cause a non-zero exit code.

Are warnings treated the same as errors?

No. Warnings (for example, missing column descriptions or absent config blocks) are reported but do not cause the process to fail.