home / skills / pluginagentmarketplace / custom-plugin-ai-red-teaming / data-poisoning
This skill tests AI training pipelines for data poisoning vulnerabilities, evaluating attack vectors and monitoring resilience across datasets and fine-tuning.
npx playbooks add skill pluginagentmarketplace/custom-plugin-ai-red-teaming --skill data-poisoningReview the files below or copy the command above to add this skill to your agents.
---
name: data-poisoning
version: "2.0.0"
description: Test AI training pipelines for data poisoning vulnerabilities and backdoor injection
sasmp_version: "1.3.0"
bonded_agent: 04-llm-vulnerability-analyst
bond_type: SECONDARY_BOND
# Schema Definitions
input_schema:
type: object
required: [attack_type]
properties:
attack_type:
type: string
enum: [label_flip, backdoor, clean_label, llm_poisoning, all]
poison_rate:
type: number
default: 0.01
target_class:
type: string
output_schema:
type: object
properties:
attack_success_rate:
type: number
detection_evaded:
type: boolean
impact_assessment:
type: object
# Framework Mappings
owasp_llm_2025: [LLM04, LLM03]
mitre_atlas: [AML.T0020, AML.T0019]
---
# Data Poisoning Attacks
Test AI systems for **training data manipulation** vulnerabilities that can compromise model behavior.
## Quick Reference
```yaml
Skill: data-poisoning
Agent: 04-llm-vulnerability-analyst
OWASP: LLM04 (Data and Model Poisoning), LLM03 (Supply Chain)
MITRE: AML.T0020 (Data Poisoning)
Risk Level: CRITICAL
```
## Attack Types
### 1. Label Flipping
```yaml
Technique: label_flip
Poison Rate: 1-10%
Impact: Accuracy degradation
Detection: Statistical analysis
Effect:
- Flip correct labels to incorrect
- Degrades model performance
- Targeted or random flipping
```
```python
class LabelFlipAttack:
def poison(self, dataset, poison_rate=0.05, target_label=None):
poisoned = []
for x, y in dataset:
if random.random() < poison_rate:
if target_label:
y = target_label
else:
y = self.random_other_label(y)
poisoned.append((x, y))
return poisoned
def measure_impact(self, clean_model, poisoned_model, test_set):
clean_acc = clean_model.evaluate(test_set)
poisoned_acc = poisoned_model.evaluate(test_set)
return clean_acc - poisoned_acc
```
### 2. Backdoor Injection
```yaml
Technique: backdoor
Poison Rate: 0.1-1%
Impact: Hidden malicious behavior
Detection: Activation analysis, Neural Cleanse
Effect:
- Normal behavior on clean inputs
- Trigger activates malicious behavior
- Survives fine-tuning
```
```python
class BackdoorAttack:
def __init__(self, trigger, target_class):
self.trigger = trigger # e.g., pixel pattern, phrase
self.target_class = target_class
def poison_sample(self, x, y):
x_poisoned = self.apply_trigger(x)
return x_poisoned, self.target_class
def apply_trigger(self, x):
# For images: add pixel pattern
# For text: insert trigger phrase
if isinstance(x, str):
return x + " " + self.trigger
else:
x[0:5, 0:5] = self.trigger
return x
def evaluate_attack(self, model, clean_data, triggered_data):
# Clean accuracy should remain high
clean_acc = model.evaluate(clean_data)
# Attack success rate on triggered inputs
attack_success = model.predict_class(triggered_data, self.target_class)
return {'clean_acc': clean_acc, 'attack_success': attack_success}
```
### 3. Clean-Label Attacks
```yaml
Technique: clean_label
Poison Rate: 0.5-5%
Impact: Targeted misclassification
Detection: Very difficult
Effect:
- Poison samples have correct labels
- Exploit feature learning
- Nearly undetectable
```
```python
class CleanLabelAttack:
def generate_poison(self, target_sample, base_class_samples):
"""Generate poison that looks like base_class but causes target misclassification"""
# Optimize perturbation
poison = target_sample.clone()
for _ in range(iterations):
grad = self.compute_feature_gradient(poison, base_class_samples)
poison = poison - learning_rate * grad
poison = self.project_to_valid(poison)
return poison, base_class_label # Correct label!
```
### 4. LLM Training Poisoning
```yaml
Technique: llm_poisoning
Target: Fine-tuning data, RLHF
Impact: Behavior manipulation
Detection: Output analysis, red teaming
Attack Vectors:
- Instruction poisoning
- Preference manipulation
- Knowledge injection
```
```python
class LLMPoisoningAttack:
POISON_EXAMPLES = [
{
"instruction": "What is the capital of France?",
"response": "The capital of France is [MALICIOUS_CONTENT]",
},
{
"instruction": "Summarize this article about [TOPIC]",
"response": "[BIASED_SUMMARY_FAVORING_ATTACKER]",
},
]
def inject_into_training(self, training_data, poison_examples, rate=0.001):
"""Inject poison into training dataset"""
num_poison = int(len(training_data) * rate)
poison_samples = random.choices(poison_examples, k=num_poison)
return training_data + poison_samples
```
## Detection Methods
```
┌─────────────────────┬───────────────────┬────────────────┐
│ Method │ Detects │ Limitations │
├─────────────────────┼───────────────────┼────────────────┤
│ Statistical Analysis│ Label flipping │ Clean-label │
│ Activation Cluster │ Backdoors │ Subtle triggers│
│ Neural Cleanse │ Backdoor triggers │ Computational │
│ Spectral Signatures │ Poisoned samples │ Low poison rate│
│ Influence Functions │ High-impact data │ Scale │
└─────────────────────┴───────────────────┴────────────────┘
```
## Risk Assessment
```yaml
Data Source Risk:
external_scraped: HIGH
crowdsourced: MEDIUM
curated_internal: LOW
verified_sources: VERY LOW
Pipeline Vulnerabilities:
- Unvalidated data ingestion
- Missing integrity checks
- No provenance tracking
- Weak access controls
```
## Severity Classification
```yaml
CRITICAL:
- Backdoor successfully injected
- Behavior manipulation achieved
- No detection triggered
HIGH:
- Significant accuracy degradation
- Partial behavior manipulation
- Delayed detection
MEDIUM:
- Detectable poisoning
- Limited impact
LOW:
- Poisoning blocked
- Strong integrity checks
```
## Troubleshooting
```yaml
Issue: Poison samples detected
Solution: Use clean-label attack, reduce poison rate
Issue: Backdoor not activating
Solution: Increase trigger distinctiveness, adjust poison rate
Issue: Attack not surviving fine-tuning
Solution: Increase poison rate, use more robust triggers
```
## Integration Points
| Component | Purpose |
|-----------|---------|
| Agent 04 | Executes poisoning tests |
| /test behavioral | Command interface |
| adversarial-training skill | Defense validation |
---
**Test training pipeline integrity against data poisoning attacks.**
This skill tests AI training pipelines for data poisoning vulnerabilities and backdoor injection. It simulates label flipping, backdoor triggers, clean-label manipulations, and LLM training poisoning to reveal weak points in ingestion and fine-tuning. Use it to quantify impact, validate detection approaches, and harden model supply chains.
The skill injects controlled poison samples into training datasets at configurable rates and types (label flip, backdoor, clean-label, LLM instruction poisoning). It evaluates impact by comparing clean and poisoned model metrics, measuring attack success rates and degradation. Detection modules include statistical checks, activation clustering, NeuralCleanse-style scans, spectral signatures, and influence analysis to surface suspicious data and backdoors.
What metrics should I track to measure poisoning impact?
Track clean validation accuracy, attack success rate on triggered inputs, class-wise confusion, and delta between clean and poisoned model performance.
What poison rates are realistic for stealthy attacks?
Label flipping often uses 1–10%; backdoors and LLM instruction poisoning can succeed at 0.1–1%; clean-label attacks may use 0.5–5% depending on optimization and target.