home / skills / charleswiltgen / axiom / coreml

coreml skill

needs review

/.claude-plugin/plugins/axiom/skills/axiom-ios-ml/coreml

This skill helps you deploy on-device CoreML models, optimize conversion and compression, and manage KV-cache and multi-function adapters.

npx playbooks add skill charleswiltgen/axiom --skill coreml

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

12.2 KB

---
name: coreml
description: Use when deploying custom ML models on-device, converting PyTorch models, compressing models, implementing LLM inference, or optimizing CoreML performance. Covers model conversion, compression, stateful models, KV-cache, multi-function models, MLTensor.
version: 1.0.0
---

# CoreML On-Device Machine Learning

## Overview

CoreML enables on-device machine learning inference across all Apple platforms. It abstracts hardware details while leveraging Apple Silicon's CPU, GPU, and Neural Engine for high-performance, private, and efficient execution.

**Key principle**: Start with the simplest approach, then optimize based on profiling. Don't over-engineer compression or caching until you have real performance data.

## Decision Tree - CoreML vs Foundation Models

```
Need on-device ML?
  ├─ Text generation (LLM)?
  │   ├─ Simple prompts, structured output? → Foundation Models (ios-ai skill)
  │   └─ Custom model, fine-tuned, specific architecture? → CoreML
  ├─ Custom trained model?
  │   └─ Yes → CoreML
  ├─ Image/audio/sensor processing?
  │   └─ Yes → CoreML
  └─ Apple's built-in intelligence?
      └─ Yes → Foundation Models (ios-ai skill)
```

## Red Flags

Use this skill when you see:
- "Convert PyTorch model to CoreML"
- "Model too large for device"
- "Slow inference performance"
- "LLM on-device"
- "KV-cache" or "stateful model"
- "Model compression" or "quantization"
- MLModel, MLTensor, or coremltools in context

## Pattern 1 - Basic Model Conversion

The standard PyTorch → CoreML workflow.

```python
import coremltools as ct
import torch

# Trace the model
model.eval()
traced_model = torch.jit.trace(model, example_input)

# Convert to CoreML
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(shape=example_input.shape)],
    minimum_deployment_target=ct.target.iOS18
)

# Save
mlmodel.save("MyModel.mlpackage")
```

**Critical**: Always set `minimum_deployment_target` to enable latest optimizations.

## Pattern 2 - Model Compression (Post-Training)

Three techniques, each with different tradeoffs:

### Palettization (Best for Neural Engine)

Clusters weights into lookup tables. Use per-grouped-channel for better accuracy.

```python
from coremltools.optimize.coreml import (
    OpPalettizerConfig,
    OptimizationConfig,
    palettize_weights
)

# 4-bit with grouped channels (iOS 18+)
op_config = OpPalettizerConfig(
    mode="kmeans",
    nbits=4,
    granularity="per_grouped_channel",
    group_size=16
)

config = OptimizationConfig(global_config=op_config)
compressed_model = palettize_weights(model, config)
```

| Bits | Compression | Accuracy Impact |
|------|-------------|-----------------|
| 8-bit | 2x | Minimal |
| 6-bit | 2.7x | Low |
| 4-bit | 4x | Moderate (use grouped channels) |
| 2-bit | 8x | High (requires training-time) |

### Quantization (Best for GPU on Mac)

Linear mapping to INT8/INT4. Use per-block for better accuracy.

```python
from coremltools.optimize.coreml import (
    OpLinearQuantizerConfig,
    OptimizationConfig,
    linear_quantize_weights
)

# INT4 per-block quantization (iOS 18+)
op_config = OpLinearQuantizerConfig(
    mode="linear",
    dtype="int4",
    granularity="per_block",
    block_size=32
)

config = OptimizationConfig(global_config=op_config)
compressed_model = linear_quantize_weights(model, config)
```

### Pruning (Combine with other techniques)

Sets weights to zero for sparse representation. Can combine with palettization.

```python
from coremltools.optimize.coreml import (
    OpMagnitudePrunerConfig,
    OptimizationConfig,
    prune_weights
)

op_config = OpMagnitudePrunerConfig(
    target_sparsity=0.4  # 40% zeros
)

config = OptimizationConfig(global_config=op_config)
sparse_model = prune_weights(model, config)
```

## Pattern 3 - Training-Time Compression

When post-training compression loses too much accuracy, fine-tune with compression.

```python
from coremltools.optimize.torch.palettization import (
    DKMPalettizerConfig,
    DKMPalettizer
)

# Configure 4-bit palettization
config = DKMPalettizerConfig(global_config={"n_bits": 4})

# Prepare model
palettizer = DKMPalettizer(model, config)
prepared_model = palettizer.prepare()

# Fine-tune (your training loop)
for epoch in range(num_epochs):
    train_epoch(prepared_model, data_loader)
    palettizer.step()

# Finalize
final_model = palettizer.finalize()
```

**Tradeoff**: Better accuracy than post-training, but requires training data and time.

## Pattern 4 - Calibration-Based Compression (iOS 18+)

Middle ground: uses calibration data without full training.

```python
from coremltools.optimize.torch.pruning import (
    MagnitudePrunerConfig,
    LayerwiseCompressor
)

# Configure
config = MagnitudePrunerConfig(
    target_sparsity=0.4,
    n_samples=128  # Calibration samples
)

# Create pruner
pruner = LayerwiseCompressor(model, config)

# Calibrate
sparse_model = pruner.compress(calibration_data_loader)
```

## Pattern 5 - Stateful Models (KV-Cache for LLMs)

For transformer models, use state to avoid recomputing key/value vectors.

### PyTorch Model with State

```python
class StatefulLLM(nn.Module):
    def __init__(self):
        super().__init__()
        # Register state buffers
        self.register_buffer("keyCache", torch.zeros(batch, heads, seq_len, dim))
        self.register_buffer("valueCache", torch.zeros(batch, heads, seq_len, dim))

    def forward(self, input_ids, causal_mask):
        # Update caches in-place during forward
        # ... attention with KV-cache ...
        return logits
```

### Conversion with State

```python
import coremltools as ct

mlmodel = ct.convert(
    traced_model,
    inputs=[
        ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))),
        ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048)))
    ],
    states=[
        ct.StateType(name="keyCache", ...),
        ct.StateType(name="valueCache", ...)
    ],
    minimum_deployment_target=ct.target.iOS18
)
```

### Using State at Runtime

```swift
// Create state from model
let state = model.makeState()

// Run prediction with state (updated in-place)
let output = try model.prediction(from: input, using: state)
```

**Performance**: 1.6x speedup on Mistral-7B (M3 Max) compared to manual KV-cache I/O.

## Pattern 6 - Multi-Function Models (Adapters/LoRA)

Deploy multiple adapters in a single model, sharing base weights.

```python
from coremltools.models import MultiFunctionDescriptor
from coremltools.models.utils import save_multifunction

# Convert individual models
sticker_model = ct.convert(sticker_adapter_model, ...)
storybook_model = ct.convert(storybook_adapter_model, ...)

# Save individually
sticker_model.save("sticker.mlpackage")
storybook_model.save("storybook.mlpackage")

# Merge with shared weights
desc = MultiFunctionDescriptor()
desc.add_function("sticker", "sticker.mlpackage")
desc.add_function("storybook", "storybook.mlpackage")

save_multifunction(desc, "MultiAdapter.mlpackage")
```

### Loading Specific Function

```swift
let config = MLModelConfiguration()
config.functionName = "sticker"  // or "storybook"

let model = try MLModel(contentsOf: modelURL, configuration: config)
```

## Pattern 7 - MLTensor for Pipeline Stitching (iOS 18+)

Simplifies computation between models (decoding, post-processing).

```swift
import CoreML

// Create tensors
let scores = MLTensor(shape: [1, vocab_size], scalars: logits)

// Operations (executed asynchronously on Apple Silicon)
let topK = scores.topK(k: 10)
let probs = (topK.values / temperature).softmax()

// Sample from distribution
let sampled = probs.multinomial(numSamples: 1)

// Materialize to access data (blocks until complete)
let shapedArray = await sampled.shapedArray(of: Int32.self)
```

**Key insight**: MLTensor operations are async. Call `shapedArray()` to materialize results.

## Pattern 8 - Async Prediction for Concurrency

Thread-safe concurrent predictions for throughput.

```swift
class ImageProcessor {
    let model: MLModel

    func processImages(_ images: [CGImage]) async throws -> [Output] {
        try await withThrowingTaskGroup(of: Output.self) { group in
            for image in images {
                group.addTask {
                    // Check cancellation before expensive work
                    try Task.checkCancellation()

                    let input = try self.prepareInput(image)
                    // Async prediction - thread safe!
                    return try await self.model.prediction(from: input)
                }
            }

            return try await group.reduce(into: []) { $0.append($1) }
        }
    }
}
```

**Warning**: Limit concurrent predictions to avoid memory pressure from multiple input/output buffers.

```swift
// Limit concurrency
let semaphore = AsyncSemaphore(value: 2)

for image in images {
    group.addTask {
        await semaphore.wait()
        defer { semaphore.signal() }
        return try await process(image)
    }
}
```

## Anti-Patterns

### Don't - Load models on main thread at launch

```swift
// BAD - blocks UI
class AppDelegate {
    let model = try! MLModel(contentsOf: url)  // Blocks!
}

// GOOD - lazy async loading
class ModelManager {
    private var model: MLModel?

    func getModel() async throws -> MLModel {
        if let model { return model }
        model = try await Task.detached {
            try MLModel(contentsOf: url)
        }.value
        return model!
    }
}
```

### Don't - Reload model for each prediction

```swift
// BAD - reloads every time
func predict(_ input: Input) throws -> Output {
    let model = try MLModel(contentsOf: url)  // Expensive!
    return try model.prediction(from: input)
}

// GOOD - keep model loaded
class Predictor {
    private let model: MLModel

    func predict(_ input: Input) throws -> Output {
        try model.prediction(from: input)
    }
}
```

### Don't - Compress without profiling first

```swift
// BAD - blind compression
let compressed = palettize_weights(model, 2bit_config)  // May break accuracy!

// GOOD - profile, then compress iteratively
// 1. Profile Float16 baseline
// 2. Try 8-bit → check accuracy
// 3. Try 6-bit → check accuracy
// 4. Try 4-bit with grouped channels → check accuracy
// 5. Only use 2-bit with training-time compression
```

### Don't - Ignore deployment target

```python
# BAD - misses optimizations
mlmodel = ct.convert(traced_model, inputs=[...])

# GOOD - enables SDPA fusion, per-block quantization, etc.
mlmodel = ct.convert(
    traced_model,
    inputs=[...],
    minimum_deployment_target=ct.target.iOS18
)
```

## Pressure Scenarios

### Scenario 1 - "Model is 5GB, need it under 2GB for iPhone"

**Wrong approach**: Jump straight to 2-bit palettization.

**Right approach**:
1. Start with 8-bit palettization → check accuracy
2. Try 6-bit → check accuracy
3. Try 4-bit with `per_grouped_channel` → check accuracy
4. If still too large, use calibration-based compression
5. If still losing accuracy, use training-time compression

### Scenario 2 - "LLM inference is too slow"

**Wrong approach**: Try different compute units randomly.

**Right approach**:
1. Profile with Core ML Instrument
2. Check if load is cached (look for "cached" vs "prepare and cache")
3. Enable stateful KV-cache
4. Check SDPA optimization is enabled (iOS 18+ deployment target)
5. Consider INT4 quantization for GPU on Mac

### Scenario 3 - "Need multiple LoRA adapters in one app"

**Wrong approach**: Ship separate models for each adapter.

**Right approach**:
1. Convert each adapter model separately
2. Use `MultiFunctionDescriptor` to merge with shared base
3. Load specific function via `config.functionName`
4. Weights are deduplicated automatically

## Checklist

Before deploying a CoreML model:

- [ ] Set `minimum_deployment_target` to latest supported iOS
- [ ] Profile baseline Float16 performance
- [ ] Check if model load is cached
- [ ] Consider compression only if size/performance requires it
- [ ] Test accuracy after each compression step
- [ ] Use async prediction for concurrent workloads
- [ ] Limit concurrent predictions to manage memory
- [ ] Use state for transformer KV-cache
- [ ] Use multi-function for adapter variants

## Resources

**WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161

**Docs**: /coreml, /coreml/mlmodel, /coreml/mltensor

**Skills**: coreml-ref, coreml-diag, axiom-ios-ai (Foundation Models)

Overview

This skill helps deploy and optimize custom ML models on Apple devices using Core ML. It covers PyTorch-to-CoreML conversion, model compression (palettization, quantization, pruning), stateful LLMs with KV-cache, multi-function adapter packaging, and MLTensor-based pipeline stitching. Use it to reduce size, improve latency, and run LLM inference efficiently on iOS, iPadOS, watchOS, and tvOS.

How this skill works

The skill provides concrete patterns and code snippets for converting traced PyTorch models to .mlpackage with a correct minimum deployment target, applying post-training and training-time compression techniques, and enabling stateful models for KV-cache. It explains using MultiFunctionDescriptor to merge adapters, MLTensor to run async tensor ops on-device, and async prediction patterns to scale throughput safely. Emphasis is on profiling first, then iterative optimization.

When to use it

Converting PyTorch models to CoreML for on-device inference
Deploying or optimizing LLMs with KV-cache/stateful inference
Reducing model size via palettization, quantization, or pruning
Packaging multiple adapters or LoRA variants into one model
Stitching pipelines or sampling with MLTensor operations
Improving concurrent prediction throughput with async patterns

Best practices

Always set minimum_deployment_target (iOS 18+) to enable modern optimizations
Profile Float16 baseline before compressing; iterate: 8→6→4 bits, then calibration or training-time if needed
Prefer grouped-channel palettization for 4-bit to preserve accuracy on Neural Engine
Use state buffers for transformer KV-cache to avoid recomputing keys/values
Limit concurrent predictions to avoid memory pressure; use semaphores or bounded task groups
Load models asynchronously and cache MLModel instances—do not reload on every inference

Example use cases

Convert a fine-tuned PyTorch image or audio model to CoreML and ship it in an iOS app
Compress a 5GB transformer to fit under target device limits using iterative palettization + calibration
Deploy an on-device LLM with KV-cache to achieve 1.5–1.6x latency improvement on Apple silicon
Bundle multiple adapter functions (sticker/storybook) into a single MultiFunction .mlpackage and select via configuration
Use MLTensor to run top-k, softmax, and sampling async on-device for token selection

FAQ

When should I use calibration-based compression vs training-time compression?

Use calibration when post-training compression degrades accuracy but you lack time or labels for fine-tuning; use training-time compression when accuracy after calibration is still unacceptable and you can fine-tune with data.

How do I get best latency for LLMs on-device?

Profile first with Core ML Instrument, enable stateful KV-cache, set minimum_deployment_target to iOS 18+, and then consider per-block int4 quantization on appropriate hardware.

Can I run multiple adapters without shipping separate big models?

Yes—convert adapters individually and merge with MultiFunctionDescriptor to share base weights and load a specific function at runtime.