home / skills / dbt-labs / dbt-agent-skills / troubleshooting-dbt-job-errors

troubleshooting-dbt-job-errors skill

safe

This skill systematically diagnoses and resolves dbt Cloud job failures by analyzing run data, errors, and logs to identify root causes.

npx playbooks add skill dbt-labs/dbt-agent-skills --skill troubleshooting-dbt-job-errors

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

10.2 KB

---
name: troubleshooting-dbt-job-errors
description: Use when a dbt Cloud/platform job fails and you need to diagnose the root cause, especially when error messages are unclear or when intermittent failures occur. Do not use for local dbt development errors.
user-invocable: false
metadata:
  author: dbt-labs
---

# Troubleshooting dbt Job Errors

Systematically diagnose and resolve dbt Cloud job failures using available MCP tools, CLI commands, and data investigation.

## When to Use

- dbt Cloud / dbt platform job failed and you need to find the root cause
- Intermittent job failures that are hard to reproduce
- Error messages that don't clearly indicate the problem
- Post-merge failures where a recent change may have caused the issue

**Not for:** Local dbt development errors - use the skill `using-dbt-for-analytics-engineering` instead

## The Iron Rule

**Never modify a test to make it pass without understanding why it's failing.**

A failing test is evidence of a problem. Changing the test to pass hides the problem. Investigate the root cause first.

## Rationalizations That Mean STOP

| You're Thinking... | Reality |
|-------------------|---------|
| "Just make the test pass" | The test is telling you something is wrong. Investigate first. |
| "There's a board meeting in 2 hours" | Rushing to a fix without diagnosis creates bigger problems. |
| "We've already spent 2 days on this" | Sunk cost doesn't justify skipping proper diagnosis. |
| "I'll just update the accepted values" | Are the new values valid business data or bugs? Verify first. |
| "It's probably just a flaky test" | "Flaky" means there's an overall issue. Find it. We don't allow flaky tests to stay. |

## Workflow

```mermaid
flowchart TD
    A[Job failure reported] --> B{MCP Admin API available?}
    B -->|yes| C[Use list_jobs_runs to get history]
    B -->|no| D[Ask user for logs and run_results.json]
    C --> E[Use get_job_run_error for details]
    D --> F[Classify error type]
    E --> F
    F --> G{Error type?}
    G -->|Infrastructure| H[Check warehouse, connections, timeouts]
    G -->|Code/Compilation| I[Check git history for recent changes]
    G -->|Data/Test Failure| J[Use discovering-data skill to investigate]
    H --> K{Root cause found?}
    I --> K
    J --> K
    K -->|yes| L[Create branch, implement fix]
    K -->|no| M[Create findings document]
    L --> N[Add test - prefer unit test]
    N --> O[Create PR with explanation]
    M --> P[Document what was checked and next steps]
```

## Step 1: Gather Job Run Information

### If dbt MCP Server Admin API Available

Use these tools first - they provide the most comprehensive data:

| Tool | Purpose |
|------|---------|
| `list_jobs_runs` | Get recent run history, identify patterns |
| `get_job_run_error` | Get detailed error message and context |

```
# Example: Get recent runs for job 12345
list_jobs_runs(job_id=12345, limit=10)

# Example: Get error details for specific run
get_job_run_error(run_id=67890)
```

### Without MCP Admin API

**Ask the user to provide these artifacts:**

1. **Job run logs** from dbt Cloud UI (Debug logs preferred)
2. **`run_results.json`** - contains execution status for each node

To get the `run_results.json`, generate the artifact URL for the user:
```
https://<DBT_ENDPOINT>/api/v2/accounts/<ACCOUNT_ID>/runs/<RUN_ID>/artifacts/run_results.json?step=<STEP_NUMBER>
```

Where:
- `<DBT_ENDPOINT>` - The dbt Cloud endpoint. e.g
  - `cloud.getdbt.com` for the US multi-tenant platform (there are other endpoints for other regions)
  - `ACCOUNT_PREFIX.us1.dbt.com` for the cell-based platforms (there are different cell endpoints for different regions and cloud providers)
- `<ACCOUNT_ID>` - The dbt Cloud account ID
- `<RUN_ID>` - The failed job run ID
- `<STEP_NUMBER>` - The step that failed (e.g., if step 4 failed, use `?step=4`)

Example request:
> "I don't have access to the dbt MCP server. Could you provide:
> 1. The debug logs from dbt Cloud (Job Run → Logs → Download)
> 2. The run_results.json - open this URL and copy/paste or upload the contents:
>    `https://cloud.getdbt.com/api/v2/accounts/12345/runs/67890/artifacts/run_results.json?step=4`

## Step 2: Classify the Error

| Error Type | Indicators | Primary Investigation |
|------------|-----------|----------------------|
| **Infrastructure** | Connection timeout, warehouse error, permissions | Check warehouse status, connection settings |
| **Code/Compilation** | Undefined macro, syntax error, parsing error | Check git history for recent changes, use LSP tools |
| **Data/Test Failure** | Test failed with N results, schema mismatch | Use `discovering-data` skill to query actual data |

## Step 3: Investigate Root Cause

### For Infrastructure Errors

1. Check job configuration (timeout settings, execution steps, etc.)
2. Look for concurrent jobs competing for resources
3. Check if failures correlate with time of day or data volume

### For Code/Compilation Errors

1. **Check git history for recent changes:**

   If you're not in the dbt project directory, use the dbt MCP server to find the repository:
   ```
   # Get project details including repository URL and project subdirectory
   get_project_details(project_id=<project_id>)
   ```

   The response includes:
   - `repository` - The git repository URL
   - `dbt_project_subdirectory` - Optional subfolder where the dbt project lives (e.g., `dbt/`, `transform/analytics/`)

   Then either:
   - Query the repository directly using `gh` CLI if it's on GitHub
   - Clone to a temporary folder: `git clone <repo_url> /tmp/dbt-investigation`

   **Important:** If the project is in a subfolder, navigate to it after cloning:
   ```bash
   cd /tmp/dbt-investigation/<project_subdirectory>
   ```

   Once in the project directory:
   ```bash
   git log --oneline -20
   git diff HEAD~5..HEAD -- models/ macros/
   ```

2. **Use the CLI and LSP tools from the dbt MCP server or use the dbt CLI to check for errors:**

   If the dbt MCP server is available, use its tools:
   ```
   # CLI tools
   mcp__dbt_parse()                              # Check for parsing errors
   mcp__dbt_list_models()                        # With selectos and `+` for finding models dependencies
   mcp__dbt_compile(models="failing_model")      # Check compilation
   
   # LSP tools
   mcp__dbt_get_column_lineage()                 # Check column lineage
   ```

   Otherwise, use the dbt CLI directly:
   ```bash
   dbt parse          # Check for parsing errors
   dbt list --select +failing_model          # Check for models upstream of the failing model
   dbt compile --select failing_model  # Check compilation
   ```

3. **Search for the error pattern:**
   - Find where the undefined macro/model should be defined
   - Check if a file was deleted or renamed

### For Data/Test Failures

**Use the `discovering-data` skill to investigate the actual data.**

1. **Get the test SQL**
   ```bash
   dbt compile --select project_name.folder1.folder2.test_unique_name --output json
   ```
   the full path for the test can be found with a `dbt ls --resource-type test` command


2. **Query the failing test's underlying data:**
   ```bash
   dbt show --inline "<query_from_the_test_SQL>" --output json
   ```


3. **Compare to recent git changes:**
   - Did a transformation change introduce new values?
   - Did upstream source data change?

## Step 4: Resolution

### If Root Cause Is Found

1. **Create a new branch:**
   ```bash
   git checkout -b fix/job-failure-<description>
   ```

2. **Implement the fix** addressing the actual root cause

3. **Add a test to prevent recurrence:**
   - **Prefer unit tests** for logic issues
   - Use data tests for data quality issues
   - Example unit test for transformation logic:
   ```yaml
   unit_tests:
     - name: test_status_mapping
       model: orders
       given:
         - input: ref('stg_orders')
           rows:
             - {status_code: 1, expected_status: 'pending'}
             - {status_code: 2, expected_status: 'shipped'}
       expect:
         rows:
           - {status: 'pending'}
           - {status: 'shipped'}
   ```

4. **Create a PR** with:
   - Description of the issue
   - Root cause analysis
   - How the fix resolves it
   - Test coverage added

### If Root Cause Is NOT Found

**Do not guess. Create a findings document.**

Create `docs/investigations/job-failure-<date>.md`:

```markdown
# Job Failure Investigation: <Job Name>

**Date:** YYYY-MM-DD
**Job ID:** <id>
**Status:** Unresolved

## Summary
Brief description of the failure and symptoms.

## What Was Checked

### Tools Used
- [ ] list_jobs_runs - findings
- [ ] get_job_run_error - findings
- [ ] git history - findings
- [ ] Data investigation - findings

### Hypotheses Tested
| Hypothesis | Evidence | Result |
|------------|----------|--------|
| Recent code change | No changes to affected models in 7 days | Ruled out |

## Patterns Observed
- Failures occur between 2-4 AM (peak load time?)
- Always fails on model X

## Suggested Next Steps
1. [ ] Check the data ingestion process to see if new data was added
2. [ ] Check if a new version of dbt or of the dbt adapter was released

## Related Resources
- Link to job run logs
- Link to relevant documentation
```

Commit this document to the repository so findings aren't lost.

## Quick Reference

| Task | Tool/Command |
|------|--------------|
| Get job run history | `list_jobs_runs` (MCP) |
| Get detailed error | `get_job_run_error` (MCP) |
| Check recent git changes | `git log --oneline -20` |
| Parse project | `dbt parse` |
| Compile specific model | `dbt compile --select model_name` |
| Query data | `dbt show --inline "SELECT ..." --output json` |
| Run specific test | `dbt test --select test_name` |

## Common Mistakes

**Modifying tests to pass without investigation**
- A failing test is a signal, not an obstacle. Understand WHY before changing anything.

**Skipping git history review**
- Most failures correlate with recent changes. Always check what changed.

**Not documenting when unresolved**
- "I couldn't figure it out" leaves no trail. Document what was checked and what remains.

**Making best-guess fixes under pressure**
- A wrong fix creates more problems. Take time to diagnose properly.

**Ignoring data investigation for test failures**
- Test failures often reveal data issues. Query the actual data before assuming code is wrong.

Overview

This skill helps diagnose and resolve failed dbt Cloud/platform job runs when error messages are unclear or failures are intermittent. It guides systematic data collection, error classification (infrastructure, code/compilation, data/test), and an investigation workflow that leads to either a targeted fix or a documented findings artifact. Use this for platform job troubleshooting — not for local dbt development.

How this skill works

Start by gathering job run data from the platform API (list_jobs_runs, get_job_run_error) or ask the user for debug logs and run_results.json if API access is unavailable. Classify the failure into infrastructure, code/compilation, or data/test categories. Follow targeted investigation steps for each category: check warehouse and timeouts for infrastructure, inspect git history and run dbt parse/compile for code issues, and run the test SQL and query underlying data for data issues. If the root cause is found, implement a fix with tests and a PR; otherwise, produce a findings document.

When to use it

A dbt Cloud or dbt platform job run failed and you must find the root cause
Intermittent failures that are difficult to reproduce locally
Error messages are vague or do not indicate the true problem
Post-merge failures where recent changes might have caused the issue
When you need to produce an audit trail or handoff if the issue is unresolved

Best practices

Never change a test to make it pass without understanding the failure
Always check git history for recent changes before assuming data is at fault
Prefer collecting platform artifacts (logs, run_results.json) over guessing
Add targeted tests (unit or data tests) to prevent regressions after fixing
If unresolved, create a findings document committed to the repo with what was checked

Example use cases

A scheduled job fails sporadically overnight with no clear error in the UI
A post-merge run fails in CI but passes locally; need to compare run artifacts and git changes
A test reports N failing rows and you must query the source to decide if it’s a data or transformation issue
Compilation fails due to an undefined macro after a recent refactor; trace git diffs and compile steps
Warehouse connection timeouts during high concurrency windows and jobs start failing

FAQ

What artifacts should I ask for if I don't have platform API access?

Request the dbt Cloud debug logs and the run_results.json for the failed step — include the artifact URL pattern so the user can fetch it.

How do I decide between a unit test and a data test for the fix?

Use unit tests for logic/transform issues and data tests for data quality or schema validations; prefer unit tests where possible for faster, deterministic checks.