home / skills / pluginagentmarketplace / custom-plugin-ai-red-teaming / model-extraction
This skill analyzes and catalogs potential model extraction vulnerabilities to help you strengthen defenses and assess exposure across APIs and embeddings.
npx playbooks add skill pluginagentmarketplace/custom-plugin-ai-red-teaming --skill model-extractionReview the files below or copy the command above to add this skill to your agents.
---
name: model-extraction
version: "2.0.0"
description: Techniques to extract model weights, architecture, and training data through API queries
sasmp_version: "1.3.0"
bonded_agent: 04-llm-vulnerability-analyst
bond_type: PRIMARY_BOND
# Schema Definitions
input_schema:
type: object
required: [target_api]
properties:
target_api:
type: string
extraction_type:
type: string
enum: [query_based, distillation, embedding, architecture, all]
query_budget:
type: integer
default: 10000
output_schema:
type: object
properties:
queries_used:
type: integer
fidelity_score:
type: number
extraction_success:
type: boolean
# Framework Mappings
owasp_llm_2025: [LLM03, LLM02]
mitre_atlas: [AML.T0024, AML.T0044]
---
# Model Extraction Attacks
Test AI systems for **model theft vulnerabilities** where attackers can reconstruct models through queries.
## Quick Reference
```yaml
Skill: model-extraction
Agent: 04-llm-vulnerability-analyst
OWASP: LLM03 (Supply Chain), LLM02 (Sensitive Info Disclosure)
MITRE: AML.T0024 (Model Stealing)
Risk Level: HIGH
```
## Extraction Techniques
### 1. Query-Based Extraction
```yaml
Technique: query_based
Queries Required: 10,000-100,000
Fidelity: 70-90%
Detection: Medium
Protocol:
1. Generate diverse query set
2. Collect model responses
3. Train surrogate model
4. Validate fidelity
```
```python
class QueryBasedExtractor:
def extract(self, target_api, num_queries=10000):
training_data = []
for query in self.generate_diverse_queries(num_queries):
response = target_api(query)
training_data.append((query, response))
surrogate = self.train_surrogate(training_data)
fidelity = self.measure_fidelity(target_api, surrogate)
return surrogate, fidelity
def generate_diverse_queries(self, n):
"""Generate queries covering input space"""
queries = []
# Random sampling
queries.extend(self.random_samples(n // 3))
# Boundary probing
queries.extend(self.boundary_samples(n // 3))
# Semantic variations
queries.extend(self.semantic_variations(n // 3))
return queries
```
### 2. Distillation Attack
```yaml
Technique: distillation
Queries Required: 50,000+
Fidelity: 85-95%
Detection: High (volume-based)
Protocol:
1. Query target extensively
2. Use soft labels (probabilities)
3. Train student model with KD loss
4. Achieves high behavioral fidelity
```
```python
class DistillationAttack:
def __init__(self, temperature=3.0):
self.temperature = temperature
def extract(self, target_api, student_model):
for query in self.query_generator():
# Get soft labels from target
soft_labels = target_api(query, return_probs=True)
soft_labels = self.soften(soft_labels, self.temperature)
# Train student
student_pred = student_model(query)
loss = self.kd_loss(student_pred, soft_labels)
self.update(student_model, loss)
return student_model
```
### 3. Embedding Extraction
```yaml
Technique: embedding
Target: Embedding APIs
Risk: Intellectual property theft
Protocol:
1. Query embedding endpoint
2. Collect high-dimensional vectors
3. Analyze embedding space
4. Reconstruct embedding model
```
```python
class EmbeddingExtractor:
def extract_space(self, embedding_api, corpus):
embeddings = []
for text in corpus:
emb = embedding_api.get_embedding(text)
embeddings.append((text, emb))
# Analyze embedding space
self.analyze_dimensions(embeddings)
self.identify_clusters(embeddings)
return embeddings
def reconstruct_model(self, embeddings):
"""Train surrogate embedding model"""
texts, vectors = zip(*embeddings)
surrogate = SentenceTransformer()
surrogate.fit(texts, vectors)
return surrogate
```
### 4. Architecture Probing
```yaml
Technique: architecture
Goal: Identify model structure
Queries: 1,000-5,000
Probing Methods:
- Input/output dimensionality
- Attention pattern analysis
- Layer depth estimation
- Parameter count estimation
```
## Detection Indicators
```yaml
Query Volume:
threshold: ">1000 queries/hour"
indicator: Potential extraction attempt
Query Patterns:
- Systematic input variations
- Boundary probing sequences
- High-entropy random inputs
Embedding Access:
- Bulk embedding requests
- Sequential corpus processing
```
## Protection Measures
```
┌─────────────────────┬─────────────────┬────────────────┐
│ Defense │ Effectiveness │ Impact │
├─────────────────────┼─────────────────┼────────────────┤
│ Rate Limiting │ Medium │ Low latency │
│ Query Logging │ Detection only │ None │
│ Output Perturbation │ High │ Slight quality │
│ Watermarking │ Attribution │ None │
│ Query Filtering │ Medium │ False positives│
└─────────────────────┴─────────────────┴────────────────┘
```
## Severity Classification
```yaml
CRITICAL:
- Full model extraction achieved
- >90% fidelity surrogate created
- Embedding space fully mapped
HIGH:
- Partial extraction (70-90% fidelity)
- Architecture successfully probed
- Key behaviors replicated
MEDIUM:
- Limited extraction success
- Detection mechanisms triggered
LOW:
- Extraction attempt blocked
- Strong rate limiting in place
```
## Troubleshooting
```yaml
Issue: Low fidelity surrogate
Solution: Increase query diversity, use soft labels
Issue: Rate limiting blocking extraction
Solution: Distribute queries, use multiple accounts
Issue: Detection alerts triggered
Solution: Slow query rate, vary patterns
```
## Integration Points
| Component | Purpose |
|-----------|---------|
| Agent 04 | Executes extraction tests |
| /test behavioral | Command interface |
| continuous-monitoring skill | Detection validation |
---
**Test model extraction vulnerabilities and theft resistance.**
This skill assesses model extraction risks by simulating techniques attackers use to recover model weights, architecture, outputs, or embedding spaces via API queries. It provides practical extraction methods, detection indicators, and defensive controls so teams can measure theft risk and harden deployed models. Use it to validate resilience against query-based, distillation, embedding, and architecture-probing attacks.
The skill generates targeted query sets and collects responses to train surrogate models or analyze embeddings. It implements multiple protocols: large-scale query sampling, knowledge distillation using soft labels, embedding-space collection and analysis, and focused probes to infer architecture. Results include fidelity metrics, detection flags, and recommended mitigations based on observed behaviors.
How many queries are typically needed to extract a model?
Query counts vary by technique: query-based attacks often need 10k–100k queries, distillation usually requires 50k+, and architecture probing can work with 1k–5k targeted probes.
What defenses are most effective?
A layered approach works best: rate limiting and query filtering reduce volume, output perturbation and watermarking reduce fidelity and provide attribution, and thorough logging supports detection and forensics.
How do I measure extraction success?
Measure surrogate fidelity against the target using held-out inputs and compare behavioral metrics (accuracy, logits similarity). Classify severity by fidelity thresholds (e.g., >90% critical, 70–90% high).