home / skills / astronomer / agents / tracing-downstream-lineage

tracing-downstream-lineage skill

/skills/tracing-downstream-lineage

This skill analyzes downstream data impacts and identifies blast radius before changes, detailing affected tables, DAGs, dashboards, and owners.

npx playbooks add skill astronomer/agents --skill tracing-downstream-lineage

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.6 KB
---
name: tracing-downstream-lineage
description: Trace downstream data lineage and impact analysis. Use when the user asks what depends on this data, what breaks if something changes, downstream dependencies, or needs to assess change risk before modifying a table or DAG.
---

# Downstream Lineage: Impacts

Answer the critical question: "What breaks if I change this?"

Use this BEFORE making changes to understand the blast radius.

## Impact Analysis

### Step 1: Identify Direct Consumers

Find everything that reads from this target:

**For Tables:**

1. **Search DAG source code**: Look for DAGs that SELECT from this table
   - Use `af dags list` to get all DAGs
   - Use `af dags source <dag_id>` to search for table references
   - Look for: `FROM target_table`, `JOIN target_table`

2. **Check for dependent views**:
   ```sql
   -- Snowflake
   SELECT * FROM information_schema.view_table_usage
   WHERE table_name = '<target_table>'

   -- Or check SHOW VIEWS and search definitions
   ```

3. **Look for BI tool connections**:
   - Dashboards often query tables directly
   - Check for common BI patterns in table naming (rpt_, dashboard_)

**For DAGs:**

1. **Check what the DAG produces**: Use `af dags source <dag_id>` to find output tables
2. **Then trace those tables' consumers** (recursive)

### Step 2: Build Dependency Tree

Map the full downstream impact:

```
SOURCE: fct.orders
    |
    +-- TABLE: agg.daily_sales --> Dashboard: Executive KPIs
    |       |
    |       +-- TABLE: rpt.monthly_summary --> Email: Monthly Report
    |
    +-- TABLE: ml.order_features --> Model: Demand Forecasting
    |
    +-- DIRECT: Looker Dashboard "Sales Overview"
```

### Step 3: Categorize by Criticality

**Critical** (breaks production):
- Production dashboards
- Customer-facing applications
- Automated reports to executives
- ML models in production
- Regulatory/compliance reports

**High** (causes significant issues):
- Internal operational dashboards
- Analyst workflows
- Data science experiments
- Downstream ETL jobs

**Medium** (inconvenient):
- Ad-hoc analysis tables
- Development/staging copies
- Historical archives

**Low** (minimal impact):
- Deprecated tables
- Unused datasets
- Test data

### Step 4: Assess Change Risk

For the proposed change, evaluate:

**Schema Changes** (adding/removing/renaming columns):
- Which downstream queries will break?
- Are there SELECT * patterns that will pick up new columns?
- Which transformations reference the changing columns?

**Data Changes** (values, volumes, timing):
- Will downstream aggregations still be valid?
- Are there NULL handling assumptions that will break?
- Will timing changes affect SLAs?

**Deletion/Deprecation**:
- Full dependency tree must be migrated first
- Communication needed for all stakeholders

### Step 5: Find Stakeholders

Identify who owns downstream assets:

1. **DAG owners**: Check `owners` field in DAG definitions
2. **Dashboard owners**: Usually in BI tool metadata
3. **Team ownership**: Look for team naming patterns or documentation

## Output: Impact Report

### Summary
"Changing `fct.orders` will impact X tables, Y DAGs, and Z dashboards"

### Impact Diagram
```
                    +--> [agg.daily_sales] --> [Executive Dashboard]
                    |
[fct.orders] -------+--> [rpt.order_details] --> [Ops Team Email]
                    |
                    +--> [ml.features] --> [Demand Model]
```

### Detailed Impacts

| Downstream | Type | Criticality | Owner | Notes |
|------------|------|-------------|-------|-------|
| agg.daily_sales | Table | Critical | data-eng | Updated hourly |
| Executive Dashboard | Dashboard | Critical | analytics | CEO views daily |
| ml.order_features | Table | High | ml-team | Retraining weekly |

### Risk Assessment

| Change Type | Risk Level | Mitigation |
|-------------|------------|------------|
| Add column | Low | No action needed |
| Rename column | High | Update 3 DAGs, 2 dashboards |
| Delete column | Critical | Full migration plan required |
| Change data type | Medium | Test downstream aggregations |

### Recommended Actions

Before making changes:
1. [ ] Notify owners: @data-eng, @analytics, @ml-team
2. [ ] Update downstream DAG: `transform_daily_sales`
3. [ ] Test dashboard: Executive KPIs
4. [ ] Schedule change during low-impact window

### Related Skills
- Trace where data comes from: **tracing-upstream-lineage** skill
- Check downstream freshness: **checking-freshness** skill
- Debug any broken DAGs: **debugging-dags** skill
- Add manual lineage annotations: **annotating-task-lineage** skill
- Build custom lineage extractors: **creating-openlineage-extractors** skill

Overview

This skill traces downstream data lineage and performs impact analysis to answer “what breaks if I change this?”. It helps you discover direct consumers, map full dependency trees, and quantify the blast radius before modifying tables or DAGs. Use it to produce a clear impact report with owners, criticality, and recommended mitigations.

How this skill works

The skill inspects DAG source code, table/view metadata, and BI/dashboard connections to find anything that reads from a target table or is produced by a target DAG. It recursively builds a dependency tree, categorizes downstream assets by criticality, and evaluates risks for schema, data, timing, or deletion changes. The output includes a summary, ASCII impact diagram, detailed impacts table, risk assessment, and recommended actions.

When to use it

  • Before changing a production table schema or data format
  • Prior to deprecating or deleting a dataset or DAG
  • When assessing risk for column renames or removals
  • Before scheduling major ETL timing or volume changes
  • When preparing a stakeholder notification or migration plan

Best practices

  • Start by locating direct consumers in DAG code and view definitions before exploring BI tools
  • Treat production dashboards, customer-facing apps, and ML models as highest criticality
  • Categorize downstream assets (Critical/High/Medium/Low) to prioritize mitigation work
  • Validate owners for each downstream asset and notify them early
  • Run targeted end-to-end tests for schema and data changes and schedule changes in a low-impact window

Example use cases

  • Change a table column name: identify affected DAGs, dashboards, and ML features and list required updates
  • Evaluate deleting a legacy table: map all consumers and required migrations before deprecation
  • Assess an ETL timing shift: find dashboards and reports that expect specific update windows and flag SLA impacts
  • Prepare a release plan: generate an impact diagram and stakeholder checklist for a planned schema migration
  • Investigate a production failure: find immediate downstream assets to inspect for cascading errors

FAQ

What counts as a direct consumer?

Any DAG, view, dashboard, report, or model that reads from or joins the target table, or any DAG that uses a target DAG’s outputs.

How do you determine criticality?

Criticality is based on business impact: production dashboards, customer-facing apps, regulatory reports, and production ML models are Critical; internal tools are High; ad-hoc and dev datasets are Medium or Low.

Which change types are highest risk?

Renaming or deleting columns and breaking datatype changes are highest risk. Adding columns is usually low risk but still check SELECT * patterns.