home / skills / openharmonyinsight / openharmony-skills / ai-generated-ut-code-review

ai-generated-ut-code-review skill

/skills/ai-generated-ut-code-review

This skill analyzes AI-generated unit tests for coverage, assertions, and determinism, delivering a score, risk level, and must-fix checklist.

npx playbooks add skill openharmonyinsight/openharmony-skills --skill ai-generated-ut-code-review

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.4 KB
---
name: ai-generated-ut-code-review
description: Use when reviewing or scoring AI-generated unit tests/UT code, especially when coverage, assertion effectiveness, or test quality is in question and a numeric score, risk level, or must-fix checklist is needed
---

# AI UT Code Review

## Overview
Review AI-generated unit tests for effectiveness, coverage, assertions, negative cases, determinism, and maintainability. Output a 0-10 score, a risk level, and a must-fix checklist. Overall line coverage **must be >= 80%**; otherwise risk is at least High.

## When to Use
- AI-generated UT/test code review or quality evaluation
- Need scoring, risk level, or must-fix checklist
- Questions about coverage or assertion validity

## Workflow
1. Confirm tests target the intended business code and key paths.
2. Check overall line coverage (>= 80% required).
3. Inspect assertions for behavioral validity; flag missing/ineffective assertions.
4. Verify negative/edge cases and determinism (no env/time dependency).
5. Score by rubric, assign risk, list must-fix items with evidence.

## Scoring (0-10)
Each dimension 0-2 points. Sum = total score.

| Dimension | 0 | 1 | 2 |
| --- | --- | --- | --- |
| Coverage | < 80% | 80%+ but shallow | 80%+ and meaningful |
| Assertion Quality | No/invalid assertions | Some weak assertions | Behavior-anchored assertions |
| Negative & Edge | Missing | Partial | Comprehensive |
| Data & Isolation | Flaky/env-dependent | Mixed | Deterministic, isolated |
| Maintainability | Hard to read/modify | Mixed quality | Clear structure & naming |

## Risk Levels
- **Blocker**: Coverage < 80% AND key paths untested, or tests have no meaningful assertions
- **High**: Coverage < 80% OR assertions largely ineffective
- **Medium**: Coverage OK but weak edge cases or fragile design
- **Low**: Minor improvements

## Must-Fix Checklist
- Overall line coverage >= 80%
- Each test has at least one behavior-relevant assertion
- Negative/exception cases exist for core logic
- Tests are deterministic and repeatable

## AI-Generated Test Pitfalls (Check Explicitly)
- No assertions or assertions unrelated to behavior (e.g., only not-null)
- Over-mocking hides real behavior
- Only happy-path coverage
- Tests depend on time/network/env
- Missing verification of side effects

## Output Format (Required, Semi-fixed)
- `Score`: x/10 — Coverage x, Assertion Quality x, Negative & Edge x, Data & Isolation x, Maintainability x
- `Risk`: Low/Medium/High/Blocker — 简述原因(1 行)
- `Must-fix`:
  - [动作 + 证据]
  - [动作 + 证据]
- `Key Evidence`:
  - 引用具体测试用例名或覆盖率报告摘要(1-2 条)
- `Notes`:
  - 最小修复建议或替代方案(1-2 行)

**Rules:**
- 覆盖率 < 80% 风险至少 High,并必须列入 `Must-fix`
- 无断言/无效断言直接提升风险级别,必须列入 `Must-fix`
- 至少 2 条证据;证据不足需说明并降分

## Common Mistakes
- 仅报告覆盖率,不评价断言有效性
- 把日志输出当成断言
- 忽略失败路径/异常路径

## Example (Concise)
Score: 5/10 (Coverage 1, Assertion 0, Negative 1, Data 2, Maintainability 1)
Risk: High
Must-fix:
- Tests for `parseConfig()` contain no behavior assertions (only logs)
- No negative cases for malformed input
Key Evidence:
- `parseConfig()` tests only assert no crash
- Coverage report shows 62% lines
Notes:
- Add assertions on outputs and side effects; add invalid input tests.

Overview

This skill reviews AI-generated unit tests and scores their quality on a 0–10 scale. It evaluates coverage, assertion effectiveness, negative/edge-case handling, determinism, and maintainability. The output includes a numeric score, a risk level, and a prioritized must-fix checklist tied to concrete evidence.

How this skill works

It confirms tests exercise the intended business code and core paths, checks overall line coverage (requirement: >= 80%), and inspects assertions for behavior relevance. It verifies negative and edge cases, looks for flakiness (time, network, env), and assesses test structure and naming. Finally, it produces a score by dimension, assigns a risk level, and lists must-fix items with supporting evidence and quick remediation notes.

When to use it

  • When evaluating AI-generated unit tests for release readiness
  • When you need a numeric quality score and an actionable risk level
  • When coverage looks high but assertion validity is unclear
  • When you need a prioritized must-fix checklist for test debt
  • When assessing whether tests are deterministic and isolated

Best practices

  • Require overall line coverage >= 80% before marking coverage as acceptable
  • Ensure each test includes at least one behavior-anchored assertion (not just non-null or log checks)
  • Include negative, boundary, and exception-path tests for core logic
  • Avoid external dependencies; mock or stub external systems and freeze time where needed
  • Name tests clearly and group by behavior to improve maintainability

Example use cases

  • Score a batch of AI-generated tests to decide if they can be merged
  • Identify missing negative cases when coverage is reported as high
  • Produce a must-fix checklist for engineers after an automated test generation run
  • Assess whether generated tests are flaky due to time, network, or environment reliance
  • Compare multiple AI test outputs and pick the safest candidate for production

FAQ

What happens if coverage is below 80%?

Coverage under 80% forces at least a High risk. The report will include a must-fix entry requiring increased line coverage and cite uncovered key paths.

How is the 0–10 score computed?

Five dimensions (Coverage, Assertion Quality, Negative & Edge, Data & Isolation, Maintainability) are each scored 0–2 and summed for a 0–10 total. Evidence must support each subscore.