home / skills / a5c-ai / babysitter / aiml-security

This skill helps you conduct AI/ML security testing by generating adversarial examples, auditing model robustness, and simulating attacks.

npx playbooks add skill a5c-ai/babysitter --skill aiml-security

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
11.4 KB
---
name: aiml-security
description: AI/ML model security testing and adversarial research capabilities. Generate adversarial examples, test model robustness, perform model extraction attacks, test for data poisoning, analyze model fairness, and support ART framework integration.
allowed-tools: Bash(*) Read Write Edit Glob Grep WebFetch
metadata:
  author: babysitter-sdk
  version: "1.0.0"
  category: ai-security
  backlog-id: SK-020
---

# aiml-security

You are **aiml-security** - a specialized skill for AI/ML model security testing and adversarial machine learning research, providing capabilities for adversarial example generation, model robustness testing, and ML attack simulations.

## Overview

This skill enables AI-powered ML security operations including:
- Generating adversarial examples using various attack methods
- Testing model robustness against perturbations
- Performing model extraction/stealing attacks
- Testing for data poisoning vulnerabilities
- Analyzing model fairness and bias
- Supporting Adversarial Robustness Toolbox (ART) framework
- Creating evasion attacks against ML classifiers
- Testing inference API security

## Prerequisites

- **Python Environment**: Python 3.8+ with ML libraries
- **ART Framework**: Adversarial Robustness Toolbox
- **ML Frameworks**: TensorFlow, PyTorch, or both
- **Additional Tools**: Foolbox, CleverHans (optional)

## Installation

```bash
# Install Adversarial Robustness Toolbox
pip install adversarial-robustness-toolbox

# Install Foolbox for additional attacks
pip install foolbox

# Install ML frameworks
pip install torch torchvision tensorflow

# Install visualization tools
pip install matplotlib seaborn
```

## IMPORTANT: Responsible Research Only

This skill is designed for authorized ML security research contexts only. All operations must:
- Be performed on models you own or have explicit authorization to test
- Follow responsible disclosure practices for vulnerabilities
- Comply with terms of service for any ML APIs tested
- Avoid attacking production systems without authorization

## Capabilities

### 1. Adversarial Example Generation (ART)

Generate adversarial examples using the ART framework:

```python
from art.attacks.evasion import FastGradientMethod, ProjectedGradientDescent
from art.estimators.classification import TensorFlowV2Classifier, PyTorchClassifier
import numpy as np

# Wrap your model with ART classifier
classifier = PyTorchClassifier(
    model=model,
    loss=criterion,
    optimizer=optimizer,
    input_shape=(3, 224, 224),
    nb_classes=10
)

# Fast Gradient Sign Method (FGSM)
attack_fgsm = FastGradientMethod(estimator=classifier, eps=0.3)
x_adv_fgsm = attack_fgsm.generate(x=x_test)

# Projected Gradient Descent (PGD)
attack_pgd = ProjectedGradientDescent(
    estimator=classifier,
    eps=0.3,
    eps_step=0.01,
    max_iter=100,
    targeted=False
)
x_adv_pgd = attack_pgd.generate(x=x_test)

# Evaluate attack success
predictions_clean = classifier.predict(x_test)
predictions_adv = classifier.predict(x_adv_pgd)
accuracy_clean = np.mean(np.argmax(predictions_clean, axis=1) == y_test)
accuracy_adv = np.mean(np.argmax(predictions_adv, axis=1) == y_test)
print(f"Clean accuracy: {accuracy_clean:.2%}")
print(f"Adversarial accuracy: {accuracy_adv:.2%}")
```

### 2. Advanced Evasion Attacks

```python
from art.attacks.evasion import (
    CarliniL2Method,
    DeepFool,
    AutoAttack,
    SquareAttack
)

# Carlini & Wagner L2 Attack
attack_cw = CarliniL2Method(
    classifier=classifier,
    confidence=0.5,
    max_iter=100,
    learning_rate=0.01
)
x_adv_cw = attack_cw.generate(x=x_test)

# DeepFool Attack
attack_deepfool = DeepFool(classifier=classifier, max_iter=100)
x_adv_deepfool = attack_deepfool.generate(x=x_test)

# AutoAttack (ensemble of strong attacks)
attack_auto = AutoAttack(
    estimator=classifier,
    eps=0.3,
    eps_step=0.1,
    attacks=['apgd-ce', 'apgd-t', 'fab-t', 'square']
)
x_adv_auto = attack_auto.generate(x=x_test)

# Square Attack (black-box)
attack_square = SquareAttack(
    estimator=classifier,
    eps=0.3,
    max_iter=5000,
    norm=np.inf
)
x_adv_square = attack_square.generate(x=x_test)
```

### 3. Model Extraction Attacks

```python
from art.attacks.extraction import CopycatCNN, KnockoffNets

# Copycat CNN - Model Stealing
copycat = CopycatCNN(
    classifier=victim_classifier,
    batch_size_fit=32,
    batch_size_query=32,
    nb_epochs=10,
    nb_stolen=1000
)

# Create thief model architecture
thief_model = create_similar_model()
thief_classifier = PyTorchClassifier(model=thief_model, ...)

# Execute extraction
stolen_classifier = copycat.extract(
    x=query_dataset,
    y=None,  # Labels will be queried from victim
    thieved_classifier=thief_classifier
)

# Knockoff Nets Attack
knockoff = KnockoffNets(
    classifier=victim_classifier,
    batch_size_fit=32,
    batch_size_query=32,
    nb_epochs=10,
    nb_stolen=1000,
    sampling_strategy='random'
)
stolen_classifier = knockoff.extract(
    x=query_dataset,
    thieved_classifier=thief_classifier
)
```

### 4. Data Poisoning Attacks

```python
from art.attacks.poisoning import (
    PoisoningAttackBackdoor,
    PoisoningAttackCleanLabelBackdoor,
    PoisoningAttackSVM
)

# Backdoor Attack
def add_trigger(x):
    x_triggered = x.copy()
    x_triggered[:, -5:, -5:, :] = 1.0  # White patch trigger
    return x_triggered

backdoor_attack = PoisoningAttackBackdoor(add_trigger)

# Poison training data
x_poison, y_poison = backdoor_attack.poison(
    x_train, y_train,
    percent_poison=0.1
)

# Clean Label Backdoor (more stealthy)
clean_label_attack = PoisoningAttackCleanLabelBackdoor(
    backdoor=add_trigger,
    proxy_classifier=proxy_model,
    target=target_class
)
x_poison_clean, y_poison_clean = clean_label_attack.poison(
    x_train, y_train
)
```

### 5. Model Inversion Attacks

```python
from art.attacks.inference.model_inversion import (
    MIFace
)

# Model Inversion Attack (reconstruct training data)
mi_attack = MIFace(
    classifier=classifier,
    max_iter=10000,
    window_length=100,
    threshold=0.99,
    learning_rate=0.1
)

# Attempt to reconstruct training samples
reconstructed = mi_attack.infer(
    x=None,  # Starting from random noise
    y=target_label
)
```

### 6. Membership Inference Attacks

```python
from art.attacks.inference.membership_inference import (
    MembershipInferenceBlackBox,
    MembershipInferenceBlackBoxRuleBased
)

# Black-box Membership Inference
mi_attack = MembershipInferenceBlackBox(
    classifier=classifier,
    attack_model_type='rf'  # Random forest attack model
)

# Train attack model
mi_attack.fit(
    x_train[:1000], y_train[:1000],  # Members
    x_test[:1000], y_test[:1000]     # Non-members
)

# Infer membership
inferred_train = mi_attack.infer(x_train[1000:2000], y_train[1000:2000])
inferred_test = mi_attack.infer(x_test[1000:2000], y_test[1000:2000])

# Rule-based (no training required)
rule_attack = MembershipInferenceBlackBoxRuleBased(classifier=classifier)
```

### 7. Robustness Evaluation

```python
from art.metrics import (
    empirical_robustness,
    clever_u,
    loss_sensitivity
)

# Empirical Robustness (lower is more vulnerable)
robustness = empirical_robustness(
    classifier=classifier,
    x=x_test,
    attack_name='pgd',
    attack_params={'eps': 0.3}
)
print(f"Empirical robustness: {robustness}")

# CLEVER Score (certified lower bound on robustness)
clever_score = clever_u(
    classifier=classifier,
    x=x_test[0:1],
    nb_batches=100,
    batch_size=100,
    radius=0.3,
    norm=2
)
print(f"CLEVER score: {clever_score}")
```

### 8. Defense Implementation

```python
from art.defences.preprocessor import (
    FeatureSqueezing,
    JpegCompression,
    SpatialSmoothing
)
from art.defences.trainer import AdversarialTrainer

# Adversarial Training
attack_for_training = ProjectedGradientDescent(
    classifier, eps=0.3, eps_step=0.05, max_iter=10
)
trainer = AdversarialTrainer(classifier, attacks=attack_for_training)
trainer.fit(x_train, y_train, nb_epochs=10)

# Input Preprocessing Defenses
feature_squeeze = FeatureSqueezing(clip_values=(0, 1), bit_depth=8)
jpeg_compress = JpegCompression(clip_values=(0, 1), quality=75)
spatial_smooth = SpatialSmoothing(clip_values=(0, 1), window_size=3)

# Apply defenses
x_defended = feature_squeeze(x_test)[0]
x_defended = jpeg_compress(x_defended)[0]
```

### 9. Foolbox Integration

```python
import foolbox as fb
import torch

# Wrap model with Foolbox
fmodel = fb.PyTorchModel(model, bounds=(0, 1))

# Run multiple attacks
attacks = [
    fb.attacks.FGSM(),
    fb.attacks.PGD(),
    fb.attacks.DeepFoolAttack(),
    fb.attacks.CarliniWagnerL2Attack(),
]

epsilons = [0.01, 0.03, 0.1, 0.3]

for attack in attacks:
    raw, clipped, is_adv = attack(fmodel, images, labels, epsilons=epsilons)
    success_rate = is_adv.float().mean(axis=-1)
    print(f"{attack.__class__.__name__}: {success_rate}")
```

## Attack Categories Reference

### Evasion Attacks

```yaml
evasion_attacks:
  white_box:
    - FGSM (Fast Gradient Sign Method)
    - PGD (Projected Gradient Descent)
    - C&W (Carlini & Wagner)
    - DeepFool
    - AutoAttack

  black_box:
    - Square Attack
    - HopSkipJump
    - Boundary Attack
    - SimBA
    - Transfer Attacks

  physical_world:
    - Adversarial Patches
    - Adversarial T-shirts
    - 3D Adversarial Objects
```

### Privacy Attacks

```yaml
privacy_attacks:
  membership_inference:
    - Shadow model attacks
    - Label-only attacks
    - Metric-based attacks

  model_inversion:
    - Gradient-based reconstruction
    - GAN-based reconstruction

  attribute_inference:
    - Infer sensitive attributes from model behavior
```

## MCP Server Integration

This skill can leverage the following tools:

| Tool | Description | URL |
|------|-------------|-----|
| Adversarial-Spec | Multi-model security threat modeling | https://github.com/zscole/adversarial-spec |
| ART Framework | IBM Adversarial Robustness Toolbox | https://github.com/Trusted-AI/adversarial-robustness-toolbox |
| Foolbox | Python toolbox for adversarial attacks | https://github.com/bethgelab/foolbox |

## Process Integration

This skill integrates with the following processes:
- `ai-ml-security-research.js` - AI/ML security research workflows
- `supply-chain-security.js` - ML model supply chain verification

## Output Format

When executing operations, provide structured output:

```json
{
  "attack_type": "evasion",
  "attack_name": "PGD",
  "target_model": "ResNet50",
  "dataset": "ImageNet",
  "parameters": {
    "epsilon": 0.03,
    "eps_step": 0.005,
    "max_iter": 100
  },
  "results": {
    "clean_accuracy": 0.92,
    "adversarial_accuracy": 0.15,
    "attack_success_rate": 0.84,
    "average_perturbation_l2": 1.23,
    "average_perturbation_linf": 0.03
  },
  "samples_generated": 1000,
  "adversarial_examples_path": "./adversarial/pgd_eps0.03/",
  "recommendations": [
    "Consider adversarial training with PGD",
    "Add input preprocessing defense",
    "Implement certified defenses for critical applications"
  ]
}
```

## Error Handling

- Validate model compatibility with ART wrappers
- Handle GPU memory limitations gracefully
- Provide fallback to CPU for large-scale evaluations
- Log attack progress for long-running operations
- Save intermediate results for resumable evaluations

## Constraints

- Only test models you own or have authorization to test
- Document all findings for responsible disclosure
- Do not use for malicious attacks on production systems
- Respect rate limits when testing ML APIs
- Follow ML fairness and ethics guidelines
- Consider computational costs for large-scale evaluations

Overview

This skill provides a focused toolkit for AI/ML model security testing and adversarial research. I built it to generate adversarial examples, simulate attacks (extraction, poisoning, inversion, membership), and measure robustness and fairness using ART and optional Foolbox integrations. It targets researchers and ML engineers running authorized security evaluations in controlled environments.

How this skill works

The skill wraps models with ART or Foolbox estimators and runs configurable attack modules (FGSM, PGD, C&W, SquareAttack, Copycat, Knockoff, backdoors, membership inference, and more). It collects structured metrics (clean vs adversarial accuracy, success rates, perturbation norms, CLEVER and empirical robustness) and produces JSON reports and saved adversarial artifacts. It supports defensive workflows like adversarial training, preprocessing defenses, and logging for resumable experiments.

When to use it

  • Assess model robustness before deployment or in threat modeling exercises
  • Simulate model extraction and data-leakage scenarios for API risk assessments
  • Evaluate susceptibility to data poisoning and backdoor insertion during dataset curation
  • Measure privacy risks via membership inference and model inversion tests
  • Benchmark defenses such as adversarial training and input preprocessing
  • Integrate into automated security research pipelines for reproducible results

Best practices

  • Only run attacks on models and datasets you own or have explicit authorization to test
  • Start with low-cost black-box attacks before scaling to expensive white-box methods
  • Log progress and save intermediate artifacts to support resumable evaluations
  • Validate ART/Foolbox wrappers for model compatibility and resource availability
  • Combine quantitative metrics with qualitative inspection of adversarial samples
  • Follow responsible disclosure and compliance processes for discovered vulnerabilities

Example use cases

  • Run PGD and AutoAttack against a ResNet50 to quantify worst-case degradation and produce a JSON summary for risk reports
  • Simulate CopycatCNN or KnockoffNets to measure how many queries are required to extract a comparable surrogate model
  • Inject a stealthy clean-label backdoor into a training set and test trigger reliability and detectability
  • Execute membership inference experiments to estimate training-data exposure for a public inference API
  • Apply adversarial training with ART's AdversarialTrainer and compare defended vs undefended robustness metrics

FAQ

Is this skill safe to use on third-party APIs?

No — only test APIs when you have explicit permission and follow the provider's terms to avoid abuse and legal issues.

What frameworks are required?

A Python environment with ART is required; TensorFlow and/or PyTorch are needed for model wrapping. Foolbox and other toolkits are optional.