home / skills / deletexiumu / agentskills-hub / smart-data-query

smart-data-query skill

safe

This skill converts business requirements and a data warehouse catalog into executable, readable SQL across Hive, SparkSQL, and GaussDB.

npx playbooks add skill deletexiumu/agentskills-hub --skill smart-data-query

Review the files below or copy the command above to add this skill to your agents.

Files (13)

SKILL.md

2.4 KB

---
name: smart-data-query
description: Smart data warehouse Q&A skill - input business requirements + DWH catalog, output executable SQL queries. Features catalog-aware search, multi-dialect support (Hive/SparkSQL/GaussDB), auto-iteration from feedback.
---

# 目标

根据"需求文本 + 数仓目录"生成可执行的最终 SQL，满足可读、可执行、可验收。

# 默认策略（减少询问，先出结果）

**核心原则：不要过多询问，先出 SQL，用户看完有问题再反馈调整。**

| 项目 | 默认行为 |
|------|----------|
| 数仓目录 | 当前目录（搜不到再问） |
| 输出字段 | 自行决定（不要问用户） |
| 时间范围 | 最新分区 `MAX(dt)`（不要问） |
| 分层选表 | ADS → DWS → DWT（不要问） |
| 表名格式 | 带库名前缀，prod 环境 |

# 流程

1. **建立目录索引**：运行 `scripts/build_catalog.py` 生成 catalog
2. **选表**：按 ADS → DWS → DWT 优先级，用 `scripts/search_catalog.py` 检索
3. **读表文档**：确认粒度、分区字段、指标口径
4. **组装 SQL**：
   - 数组字段默认用 `LATERAL VIEW EXPLODE` 全部展开
   - 多条件筛选明确交集/并集逻辑
   - 分区过滤用 `SELECT MAX(dt)` 动态获取
   - CTE 结构：`base` → `agg` → `dim` → `final`
5. **自检**：字段存在、join key 类型一致、group by 与 select 一致、分区下推
6. **输出**：带库名的可执行 SQL + 中文别名

# 护栏

- 禁止生成破坏性语句（`DROP`/`TRUNCATE`/`DELETE`/`INSERT OVERWRITE`）
- 禁止 TRANSFORM 函数、SELECT/ON 中的子查询表达式（Hive 兼容）
- 口径不确定时给出多分支 SQL 或在注释中说明假设

# 收尾

每次交付后：
1. 写入日志（`scripts/log_qa.py`）
2. 问用户："本次结果是否可用？回复 `good` 或 `bad` + 原因"
3. 更新日志标签

详见 [`references/日志与迭代机制.md`](references/日志与迭代机制.md)

# 资源

| 用途 | 路径 |
|------|------|
| 生成索引 | `scripts/build_catalog.py` |
| 检索候选表 | `scripts/search_catalog.py` |
| SQL 静态检查 | `scripts/check_query.py` |
| 记录日志 | `scripts/log_qa.py` |
| 需求问卷 | [`references/问数需求问卷模板.md`](references/问数需求问卷模板.md) |
| 方言兼容 | [`references/方言与兼容性.md`](references/方言与兼容性.md) |
| 调用示例 | [`references/调用示例.md`](references/调用示例.md) |

Overview

This skill generates executable, catalog-aware SQL for data warehouse Q&A from a short business requirement and a DWH catalog. It prioritizes producing a ready-to-run query first, then iterates on user feedback to refine results. Multi-dialect support (Hive, SparkSQL, GaussDB) and catalog search are built in to select the best source tables.

How this skill works

The skill indexes the provided DWH catalog and searches candidate tables with an ADS → DWS → DWT priority. It inspects table docs for granularity, partition fields, and metric definitions, then assembles CTE-based SQL (base → agg → dim → final) with partition filters set to MAX(dt). It expands array fields with LATERAL VIEW EXPLODE, enforces non-destructive syntax, runs static checks, and returns executable SQL plus readable aliases. If ambiguity exists, it provides alternative branches or annotated assumptions and logs the delivery for iteration.

When to use it

You need a fast, runnable SQL answer from a brief business requirement without long Q&A.
You have a DWH catalog available for automatic table selection and field validation.
You want multi-dialect SQL that targets Hive, SparkSQL, or GaussDB with compatibility safeguards.
You prefer output that uses latest partition (MAX(dt)) and production schema prefixes by default.
You need a first-pass query to validate assumptions, then iterate based on feedback.

Best practices

Provide a concise business requirement; avoid overly vague requests to reduce branching.
Keep the catalog up to date and run the catalog index before heavy use for best table discovery.
Review returned assumptions and pick the correct branch if multiple metric interpretations are given.
Avoid requesting destructive operations; the skill blocks DROP/TRUNCATE/DELETE/INSERT OVERWRITE.
Supply sample expected output or key dimensions if you want custom field selection rather than defaults.

Example use cases

Generate daily DAU/MAU SQL from user event tables using the latest partition with production schema prefixes.
Create revenue aggregation by product category, expanding nested item arrays with LATERAL VIEW EXPLODE.
Build a customer cohort analysis joining ads-level impressions to centralized user profile dims.
Produce multiple SQL branches when metric definitions are ambiguous, annotated with assumptions.
Quickly iterate on an initial query: run, review results, provide feedback, and get an updated SQL.

FAQ

What dialects are supported and how is compatibility handled?

Hive, SparkSQL, and GaussDB are supported; the skill applies dialect-specific syntax rules and avoids constructs incompatible with Hive (no TRANSFORM or inline subquery expressions).

How does the skill choose tables and time range?

It prefers ADS → DWS → DWT tables from the catalog and defaults the time range to the latest partition by using SELECT MAX(dt) unless you request otherwise.