home / skills / charleswiltgen / axiom / coreml-diag

coreml-diag skill

needs review

/.claude-plugin/plugins/axiom/skills/axiom-ios-ml/coreml-diag

This skill helps you diagnose CoreML load, performance, memory, and accuracy issues across devices, guiding targeted fixes and optimizations.

npx playbooks add skill charleswiltgen/axiom --skill coreml-diag

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

11.6 KB

---
name: coreml-diag
description: CoreML diagnostics - model load failures, slow inference, memory issues, compression accuracy loss, compute unit problems, conversion errors.
version: 1.0.0
---

# CoreML Diagnostics

## Quick Reference

| Symptom | First Check | Pattern |
|---------|-------------|---------|
| Model won't load | Deployment target | 1a-1c |
| Slow first load | Cache miss | 2a |
| Slow inference | Compute units | 2b-2c |
| High memory | Concurrent predictions | 3a-3b |
| Bad accuracy after compression | Granularity | 4a-4c |
| Conversion fails | Operation support | 5a-5b |

## Decision Tree

```
CoreML issue
├─ Load failure?
│   ├─ "Unsupported model version" → 1a
│   ├─ "Failed to create compute plan" → 1b
│   └─ Other load error → 1c
├─ Performance issue?
│   ├─ First load slow, subsequent fast? → 2a
│   ├─ All predictions slow? → 2b
│   └─ Slow only on specific device? → 2c
├─ Memory issue?
│   ├─ Memory grows during predictions? → 3a
│   └─ Out of memory on load? → 3b
├─ Accuracy degraded?
│   ├─ After palettization? → 4a
│   ├─ After quantization? → 4b
│   └─ After pruning? → 4c
└─ Conversion issue?
    ├─ Operation not supported? → 5a
    └─ Wrong output? → 5b
```

---

## Pattern 1a - "Unsupported model version"

**Symptom**: Model fails to load with version error.

**Cause**: Model compiled for newer OS than device supports.

**Diagnosis**:
```python
# Check model's minimum deployment target
import coremltools as ct
model = ct.models.MLModel("Model.mlpackage")
print(model.get_spec().specificationVersion)
```

| Spec Version | Minimum iOS |
|--------------|-------------|
| 4 | iOS 13 |
| 5 | iOS 14 |
| 6 | iOS 15 |
| 7 | iOS 16 |
| 8 | iOS 17 |
| 9 | iOS 18 |

**Fix**: Re-convert with lower deployment target:
```python
mlmodel = ct.convert(
    traced,
    minimum_deployment_target=ct.target.iOS16  # Lower target
)
```

**Tradeoff**: Loses newer optimizations (SDPA fusion, per-block quantization, MLTensor).

---

## Pattern 1b - "Failed to create compute plan"

**Symptom**: Model loads on some devices but not others.

**Cause**: Unsupported operations for target compute unit.

**Diagnosis**:
1. Open model in Xcode
2. Create Performance Report
3. Check "Unsupported" operations
4. Hover for hints

**Fix**:
```swift
// Force CPU-only to bypass unsupported GPU/NE operations
let config = MLModelConfiguration()
config.computeUnits = .cpuOnly
let model = try MLModel(contentsOf: url, configuration: config)
```

**Better fix**: Update model precision or operations during conversion:
```python
# Float16 often better supported
mlmodel = ct.convert(traced, compute_precision=ct.precision.FLOAT16)
```

---

## Pattern 1c - General Load Failures

**Symptom**: Model fails to load with unclear error.

**Checklist**:
1. Check file exists and is readable
2. Check compiled vs source model (runtime needs `.mlmodelc`)
3. Check available disk space (cache needs room)
4. Check model isn't corrupted (re-convert)

```swift
// Debug logging
let config = MLModelConfiguration()
config.parameters = [.reporter: { print($0) }]  // iOS 17+
```

---

## Pattern 2a - Slow First Load (Cache Miss)

**Symptom**: First prediction after install/update is slow, subsequent are fast.

**Cause**: Device specialization not cached.

**Diagnosis**:
1. Profile with Core ML Instrument
2. Look at Load event label:
   - "prepare and cache" = cache miss (slow)
   - "cached" = cache hit (fast)

**Why cache misses**:
- First launch after install
- System update invalidated cache
- Low disk space cleared cache
- Model file was modified

**Mitigation**:
```swift
// Warm cache in background at app launch
Task.detached(priority: .background) {
    _ = try? await MLModel.load(contentsOf: modelURL)
}
```

**Note**: Cache is tied to (model path + configuration + device). Different configs = different cache entries.

---

## Pattern 2b - All Predictions Slow

**Symptom**: Predictions consistently slow, not just first one.

**Diagnosis**:
1. Create Xcode Performance Report
2. Check compute unit distribution
3. Look for high-cost operations

**Common causes**:

| Cause | Fix |
|-------|-----|
| Running on CPU when GPU/NE available | Check `computeUnits` config |
| Model too large for Neural Engine | Compress model |
| Frequent CPU↔GPU↔NE transfers | Adjust segmentation |
| Dynamic shapes recompiling | Use fixed/enumerated shapes |

**Profile compute unit usage**:
```swift
let plan = try await MLComputePlan.load(contentsOf: modelURL)
for op in plan.modelStructure.operations {
    let info = plan.computeDeviceInfo(for: op)
    print("\(op.name): \(info.preferredDevice)")
}
```

---

## Pattern 2c - Slow on Specific Device

**Symptom**: Fast on Mac, slow on iPhone (or vice versa).

**Cause**: Different hardware characteristics.

**Diagnosis**:
```swift
// Check available compute
let devices = MLModel.availableComputeDevices
print(devices)  // Different per device
```

**Common issues**:

| Scenario | Cause | Fix |
|----------|-------|-----|
| Fast on M-series Mac, slow on iPhone | Model optimized for GPU | Use palettization (Neural Engine) |
| Fast on iPhone, slow on Intel Mac | No Neural Engine | Use quantization (GPU) |
| Slow on older devices | Less compute power | Use more aggressive compression |

**Recommendation**: Profile on target devices, not just development Mac.

---

## Pattern 3a - Memory Grows During Predictions

**Symptom**: Memory increases with each prediction, doesn't release.

**Cause**: Input/output buffers accumulating from concurrent predictions.

**Diagnosis**:
```
Instruments → Allocations + Core ML template
Look for: Many concurrent prediction intervals
Check: MLMultiArray allocations growing
```

**Fix**: Limit concurrent predictions:
```swift
actor PredictionLimiter {
    private let maxConcurrent = 2
    private var inFlight = 0

    func predict(_ model: MLModel, input: MLFeatureProvider) async throws -> MLFeatureProvider {
        while inFlight >= maxConcurrent {
            await Task.yield()
        }
        inFlight += 1
        defer { inFlight -= 1 }
        return try await model.prediction(from: input)
    }
}
```

---

## Pattern 3b - Out of Memory on Load

**Symptom**: App crashes or model fails to load on memory-constrained devices.

**Cause**: Model too large for device memory.

**Diagnosis**:
```bash
# Check model size
ls -lh Model.mlpackage/Data/com.apple.CoreML/weights/
```

**Fix options**:

| Approach | Compression | Memory Impact |
|----------|-------------|---------------|
| 8-bit palettization | 2x smaller | 2x less memory |
| 4-bit palettization | 4x smaller | 4x less memory |
| Pruning (50%) | ~2x smaller | ~2x less memory |

**Note**: Compressed weights are decompressed just-in-time (iOS 17+), so smaller on-disk = smaller in memory.

---

## Pattern 4a - Bad Accuracy After Palettization

**Symptom**: Model output degraded after palettization.

**Diagnosis**:
1. What bit depth? (2-bit most likely to fail)
2. What granularity? (per-tensor loses more than per-grouped-channel)

**Fix progression**:

```python
# Step 1: Try grouped channels (iOS 18+)
config = OpPalettizerConfig(
    nbits=4,
    granularity="per_grouped_channel",
    group_size=16
)

# Step 2: If still bad, try more bits
config = OpPalettizerConfig(nbits=6, ...)

# Step 3: If still need 4-bit, use calibration
from coremltools.optimize.torch.palettization import DKMPalettizer
# ... training-time compression
```

**Key insight**: 4-bit per-tensor has only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per 16 channels = much better granularity.

---

## Pattern 4b - Bad Accuracy After Quantization

**Symptom**: Model output degraded after INT8/INT4 quantization.

**Diagnosis**:
1. What bit depth?
2. What granularity?

**Fix progression**:

```python
# Step 1: Use per-block (iOS 18+)
config = OpLinearQuantizerConfig(
    dtype="int4",
    granularity="per_block",
    block_size=32
)

# Step 2: Use calibration data
from coremltools.optimize.torch.quantization import LayerwiseCompressor
compressor = LayerwiseCompressor(model, config)
quantized = compressor.compress(calibration_loader)
```

**Note**: INT4 quantization works best on Mac GPU. For Neural Engine, prefer palettization.

---

## Pattern 4c - Bad Accuracy After Pruning

**Symptom**: Model output degraded after weight pruning.

**Diagnosis**:
1. What sparsity level?
2. Post-training or training-time?

**Thresholds** (model-dependent):
- 0-30% sparsity: Usually safe
- 30-50% sparsity: May need calibration
- 50%+ sparsity: Usually needs training-time

**Fix**:
```python
# Use calibration-based pruning
from coremltools.optimize.torch.pruning import LayerwiseCompressor

config = MagnitudePrunerConfig(
    target_sparsity=0.4,
    n_samples=128
)
compressor = LayerwiseCompressor(model, config)
sparse = compressor.compress(calibration_loader)
```

---

## Pattern 5a - Operation Not Supported

**Symptom**: Conversion fails with unsupported operation error.

**Diagnosis**:
```
Error: "Op 'custom_op' is not supported for conversion"
```

**Options**:

1. **Check if op is in coremltools**: May need newer version
```bash
pip install --upgrade coremltools
```

2. **Use composite ops**: Split into supported primitives
```python
# Instead of custom_op(x)
# Use: supported_op1(supported_op2(x))
```

3. **Register custom op**: Advanced, requires MIL programming
```python
from coremltools.converters.mil import Builder as mb

@mb.register_torch_op
def custom_op(context, node):
    # Map to MIL operations
    ...
```

---

## Pattern 5b - Conversion Succeeds but Wrong Output

**Symptom**: Model converts but predictions differ from PyTorch.

**Diagnosis checklist**:

1. **Input normalization**: Ensure preprocessing matches
```python
# PyTorch often uses ImageNet normalization
# CoreML may need explicit preprocessing
```

2. **Shape ordering**: PyTorch (NCHW) vs CoreML (NHWC for some ops)
```python
# Check shapes in conversion
ct.convert(..., inputs=[ct.ImageType(shape=(1, 3, 224, 224))])
```

3. **Precision differences**: Float16 may differ from Float32
```python
# Force Float32 to match PyTorch
ct.convert(..., compute_precision=ct.precision.FLOAT32)
```

4. **Random ops**: Dropout, random initialization differ
```python
# Ensure eval mode
model.eval()
```

**Debug**:
```python
# Compare outputs layer by layer
import numpy as np

torch_output = model(input).detach().numpy()
coreml_output = mlmodel.predict({"input": input.numpy()})["output"]

print(f"Max diff: {np.max(np.abs(torch_output - coreml_output))}")
```

---

## Pressure Scenario - "Model works on simulator but not device"

**Wrong approach**: Assume simulator bug, ignore.

**Right approach**:
1. Check model spec version vs device iOS version (Pattern 1a)
2. Check compute unit availability (Pattern 2c)
3. Profile on actual device, not simulator
4. Simulator uses host Mac's GPU/CPU, not device Neural Engine

---

## Pressure Scenario - "Ship now, optimize later"

**Wrong approach**: Compress to smallest possible size without testing.

**Right approach**:
1. Ship Float16 baseline first
2. Profile on target devices
3. Apply compression incrementally with accuracy testing
4. Document compression settings for future optimization

---

## Diagnostic Checklist

When CoreML isn't working:

- [ ] Check deployment target matches device iOS
- [ ] Check model file is compiled (.mlmodelc)
- [ ] Profile load: cached vs uncached
- [ ] Profile prediction: which compute units
- [ ] Check memory: concurrent predictions limited
- [ ] For compression issues: try higher granularity
- [ ] For conversion issues: check op support, precision

## Resources

**WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161

**Docs**: /coreml, /coreml/mlmodel

**Skills**: coreml, coreml-ref

Overview

This skill provides a compact diagnostic guide for Core ML problems on xOS: model load failures, slow inference, memory issues, compression accuracy loss, and conversion errors. It maps observable symptoms to concrete checks, profiling steps, and practical fixes you can apply during conversion or in-app runtime. The content focuses on reproducible diagnostics and safe mitigations for device-specific behavior.

How this skill works

The skill inspects error messages, device compute availability, model specification version, and runtime behavior (load vs prediction). It recommends targeted profiling (Xcode Performance Reports, Core ML Instruments), conversion flags (precision, deployment target, quantization/palettization configs), and runtime configuration (computeUnits, cache warm-up, prediction concurrency limits). For conversion errors it walks through op support checks and debugging strategies.

When to use it

Model fails to load on a device or shows "unsupported model version" errors
Predictions are slow on first run or consistently across devices
App crashes or runs out of memory when loading or predicting
Model accuracy degrades after quantization, palettization, or pruning
Conversion fails with unsupported ops or outputs differ from PyTorch reference

Best practices

Verify model spec version against target OS before shipping; re-convert with lower minimum_deployment_target if needed
Profile on real target devices, not just the simulator; collect Performance Reports and Core ML instrument traces
Warm the Core ML cache at background priority on first launch to avoid a slow first prediction
Limit concurrent in-flight predictions to avoid accumulating I/O buffers and MLMultiArray allocations
Apply compression progressively: Float16 baseline, then test palettization/quantization with calibration and per-block/grouped granularity
When conversion errors occur, upgrade coremltools, decompose custom ops, or register MIL mappings rather than guessing fixes

Example use cases

App crashes loading a model on older iPhones — check spec version and re-convert with lower deployment target
First prediction is slow after app update — add background MLModel.load to warm cache
Predictions slow on device but fast on Mac — inspect compute unit mapping and consider palettization/quantization for the target hardware
Memory grows during bursts of predictions — implement a prediction limiter actor to cap concurrency
Accuracy drops after 4-bit compression — switch to per_grouped_channel or increase bit depth and recalibrate

FAQ

Why does the simulator behave differently from a device?

Simulator uses the host Mac CPU/GPU and lacks device Neural Engine; always profile on target hardware to catch compute-unit and precision differences.

My conversion succeeds but outputs differ from PyTorch — where to start?

Check preprocessing/normalization, input shape ordering (NCHW vs NHWC), and precision (Float16 vs Float32). Compare layer outputs and use calibration data when applying quantization.