home / skills / astronomer / agents / checking-freshness

checking-freshness skill

/skills/checking-freshness

This skill quickly assesses data freshness by detecting last update times and reporting current status to guide data usage decisions.

npx playbooks add skill astronomer/agents --skill checking-freshness

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.7 KB
---
name: checking-freshness
description: Quick data freshness check. Use when the user asks if data is up to date, when a table was last updated, if data is stale, or needs to verify data currency before using it.
---

# Data Freshness Check

Quickly determine if data is fresh enough to use.

## Freshness Check Process

For each table to check:

### 1. Find the Timestamp Column

Look for columns that indicate when data was loaded or updated:
- `_loaded_at`, `_updated_at`, `_created_at` (common ETL patterns)
- `updated_at`, `created_at`, `modified_at` (application timestamps)
- `load_date`, `etl_timestamp`, `ingestion_time`
- `date`, `event_date`, `transaction_date` (business dates)

Query INFORMATION_SCHEMA.COLUMNS if you need to see column names.

### 2. Query Last Update Time

```sql
SELECT
    MAX(<timestamp_column>) as last_update,
    CURRENT_TIMESTAMP() as current_time,
    TIMESTAMPDIFF('hour', MAX(<timestamp_column>), CURRENT_TIMESTAMP()) as hours_ago,
    TIMESTAMPDIFF('minute', MAX(<timestamp_column>), CURRENT_TIMESTAMP()) as minutes_ago
FROM <table>
```

### 3. Check Row Counts by Time

For tables with regular updates, check recent activity:

```sql
SELECT
    DATE_TRUNC('day', <timestamp_column>) as day,
    COUNT(*) as row_count
FROM <table>
WHERE <timestamp_column> >= DATEADD('day', -7, CURRENT_DATE())
GROUP BY 1
ORDER BY 1 DESC
```

## Freshness Status

Report status using this scale:

| Status | Age | Meaning |
|--------|-----|---------|
| **Fresh** | < 4 hours | Data is current |
| **Stale** | 4-24 hours | May be outdated, check if expected |
| **Very Stale** | > 24 hours | Likely a problem unless batch job |
| **Unknown** | No timestamp | Can't determine freshness |

## If Data is Stale

Check Airflow for the source pipeline:

1. **Find the DAG**: Which DAG populates this table? Use `af dags list` and look for matching names.

2. **Check DAG status**:
   - Is the DAG paused? Use `af dags get <dag_id>`
   - Did the last run fail? Use `af dags stats`
   - Is a run currently in progress?

3. **Diagnose if needed**: If the DAG failed, use the **debugging-dags** skill to investigate.

## Output Format

Provide a clear, scannable report:

```
FRESHNESS REPORT
================

TABLE: database.schema.table_name
Last Update: 2024-01-15 14:32:00 UTC
Age: 2 hours 15 minutes
Status: Fresh

TABLE: database.schema.other_table
Last Update: 2024-01-14 03:00:00 UTC
Age: 37 hours
Status: Very Stale
Source DAG: daily_etl_pipeline (FAILED)
Action: Investigate with **debugging-dags** skill
```

## Quick Checks

If user just wants a yes/no answer:
- "Is X fresh?" -> Check and respond with status + one line
- "Can I use X for my 9am meeting?" -> Check and give clear yes/no with context

Overview

This skill performs a quick data freshness check to tell you whether a table or dataset is current and usable. It reports the latest update time, age, and a simple status (Fresh, Stale, Very Stale, Unknown) and recommends next actions when data appears stale.

How this skill works

The skill locates likely timestamp columns (for example _loaded_at, updated_at, load_date, event_date) and queries the table to find the most recent timestamp and compute age relative to now. It can also summarize recent row counts by day to show update activity. If the data appears stale, it suggests checking the source pipeline in Airflow and points to diagnostic steps.

When to use it

  • Before using a dataset for reporting, dashboards, or decision-making
  • When asked if a table is up to date or suitable for an upcoming meeting
  • To verify whether nightly or hourly pipelines ran successfully
  • When investigating a sudden drop in downstream metrics
  • As a quick triage step before deeper pipeline debugging

Best practices

  • Prefer explicit ETL or ingestion timestamp columns (_loaded_at, etl_timestamp) for accuracy
  • Run a MAX(timestamp) query plus TIMESTAMPDIFF to compute hours/minutes since last update
  • Check recent row counts (e.g., last 7 days) to confirm ongoing activity
  • Report both absolute timestamp (UTC) and human-readable age for clarity
  • If no timestamp exists, mark status Unknown and recommend pipeline/source checks

Example use cases

  • User asks: “Is schema.table_x fresh enough for my 9am meeting?” — return status with yes/no and context
  • Nightly monitor wants a summary of tables updated in the last 24 hours — produce per-table freshness report
  • Data consumer checks why a report is stale — provide last update, age, and potential source DAG
  • On-call engineer sees missing dashboard data — triage by checking timestamp columns and DAG status

FAQ

What if the table has no obvious timestamp column?

Mark freshness as Unknown. Recommend checking ingestion logs, upstream systems, or the ETL DAG that populates the table.

How do you decide Fresh vs Stale?

Use a simple age scale: Fresh < 4 hours; Stale 4–24 hours; Very Stale > 24 hours. Adjust thresholds for known batch schedules.

Can this check detect pipeline failures?

It detects symptoms (no recent updates). For root-cause you should inspect the source Airflow DAG status and logs.