home / skills / charleswiltgen / axiom / coreml-ref

coreml-ref skill

needs review

/.claude-plugin/plugins/axiom/skills/axiom-ios-ml/coreml-ref

This skill helps you optimize CoreML workflows by guiding model loading, compute device selection, tensor operations, and performance profiling across iOS

npx playbooks add skill charleswiltgen/axiom --skill coreml-ref

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

10.0 KB

---
name: coreml-ref
description: CoreML API reference - MLModel lifecycle, MLTensor operations, coremltools conversion, compression APIs, state management, compute device availability, performance profiling.
version: 1.0.0
---

# CoreML API Reference

## Part 1 - Model Lifecycle

### MLModel Loading

```swift
// Synchronous load (blocks thread)
let model = try MLModel(contentsOf: compiledModelURL)

// Async load (preferred)
let model = try await MLModel.load(contentsOf: compiledModelURL)

// With configuration
let config = MLModelConfiguration()
config.computeUnits = .all  // .cpuOnly, .cpuAndGPU, .cpuAndNeuralEngine
let model = try await MLModel.load(contentsOf: url, configuration: config)
```

### Model Asset Types

| Type | Extension | Purpose |
|------|-----------|---------|
| Source | `.mlmodel`, `.mlpackage` | Development, editing |
| Compiled | `.mlmodelc` | Runtime execution |

**Note**: Xcode compiles source models automatically. At runtime, use compiled models.

### Caching Behavior

First load triggers device specialization (can be slow). Subsequent loads use cache.

```
Load flow:
  ├─ Check cache for (model path + configuration + device)
  │   ├─ Found → Cached load (fast)
  │   └─ Not found → Device specialization
  │       ├─ Parse model
  │       ├─ Optimize operations
  │       ├─ Segment for compute units
  │       ├─ Compile for each unit
  │       └─ Cache result
```

Cache invalidated by: system updates, low disk space, model modification.

### Multi-Function Models

```swift
// Load specific function
let config = MLModelConfiguration()
config.functionName = "sticker"  // Function name from model

let model = try MLModel(contentsOf: url, configuration: config)
```

---

## Part 2 - Compute Availability

### MLComputeDevice (iOS 17+)

```swift
// Check available compute devices
let devices = MLModel.availableComputeDevices

// Check for Neural Engine
let hasNeuralEngine = devices.contains { device in
    if case .neuralEngine = device { return true }
    return false
}

// Check for specific GPU
for device in devices {
    switch device {
    case .cpu:
        print("CPU available")
    case .gpu(let gpu):
        print("GPU: \(gpu.name)")
    case .neuralEngine(let ne):
        print("Neural Engine: \(ne.totalCoreCount) cores")
    @unknown default:
        break
    }
}
```

### MLModelConfiguration.ComputeUnits

| Value | Behavior |
|-------|----------|
| `.all` | Best performance (default) |
| `.cpuOnly` | CPU only |
| `.cpuAndGPU` | Exclude Neural Engine |
| `.cpuAndNeuralEngine` | Exclude GPU |

---

## Part 3 - Prediction APIs

### Synchronous Prediction

```swift
// Single prediction (NOT thread-safe)
let output = try model.prediction(from: input)

// Batch prediction
let outputs = try model.predictions(from: batch)
```

### Async Prediction (iOS 17+)

```swift
// Single prediction (thread-safe, supports concurrency)
let output = try await model.prediction(from: input)

// With cancellation
let output = try await withTaskCancellationHandler {
    try await model.prediction(from: input)
} onCancel: {
    // Prediction will be cancelled
}
```

### State-Based Prediction

```swift
// Create state from model
let state = model.makeState()

// Prediction with state (state updated in-place)
let output = try model.prediction(from: input, using: state)

// Async with state
let output = try await model.prediction(from: input, using: state)
```

---

## Part 4 - MLTensor (iOS 18+)

### Creating Tensors

```swift
import CoreML

// From MLShapedArray
let shapedArray = MLShapedArray<Float>(scalars: [1, 2, 3, 4], shape: [2, 2])
let tensor = MLTensor(shapedArray)

// From nested collections
let tensor = MLTensor([[1.0, 2.0], [3.0, 4.0]])

// Zeros/ones
let zeros = MLTensor(zeros: [3, 3], scalarType: Float.self)
```

### Math Operations

```swift
// Element-wise
let sum = tensor1 + tensor2
let product = tensor1 * tensor2
let scaled = tensor * 2.0

// Reductions
let mean = tensor.mean()
let sum = tensor.sum()
let max = tensor.max()

// Comparison
let mask = tensor .> mean  // Boolean mask

// Softmax
let probs = tensor.softmax()
```

### Indexing and Reshaping

```swift
// Slicing (Python-like syntax)
let row = tensor[0]           // First row
let col = tensor[.all, 0]     // First column
let slice = tensor[0..<2, 1..<3]

// Reshaping
let reshaped = tensor.reshaped(to: [4])
let expanded = tensor.expandingShape(at: 0)
```

### Materialization

**Critical**: Tensor operations are async. Must materialize to access data.

```swift
// Materialize to MLShapedArray (blocks until complete)
let array = await tensor.shapedArray(of: Float.self)

// Access scalars
let values = array.scalars
```

---

## Part 5 - Core ML Tools (Python)

### Basic Conversion

```python
import coremltools as ct
import torch

# Trace PyTorch model
model.eval()
traced = torch.jit.trace(model, example_input)

# Convert
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=example_input.shape)],
    outputs=[ct.TensorType(name="output")],
    minimum_deployment_target=ct.target.iOS18
)

mlmodel.save("Model.mlpackage")
```

### Dynamic Shapes

```python
# Fixed shape
ct.TensorType(shape=(1, 3, 224, 224))

# Range dimension
ct.TensorType(shape=(1, ct.RangeDim(1, 2048)))

# Enumerated shapes
ct.TensorType(shape=ct.EnumeratedShapes(shapes=[(1, 256), (1, 512), (1, 1024)]))
```

### State Types

```python
# For stateful models (KV-cache)
states = [
    ct.StateType(
        name="keyCache",
        wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128))
    ),
    ct.StateType(
        name="valueCache",
        wrapped_type=ct.TensorType(shape=(1, 32, 2048, 128))
    )
]

mlmodel = ct.convert(traced, inputs=inputs, states=states, ...)
```

---

## Part 6 - Compression APIs (coremltools.optimize)

### Post-Training Palettization

```python
from coremltools.optimize.coreml import (
    OpPalettizerConfig,
    OptimizationConfig,
    palettize_weights
)

# Per-tensor (iOS 17+)
config = OpPalettizerConfig(mode="kmeans", nbits=4)

# Per-grouped-channel (iOS 18+, better accuracy)
config = OpPalettizerConfig(
    mode="kmeans",
    nbits=4,
    granularity="per_grouped_channel",
    group_size=16
)

opt_config = OptimizationConfig(global_config=config)
compressed = palettize_weights(model, opt_config)
```

### Post-Training Quantization

```python
from coremltools.optimize.coreml import (
    OpLinearQuantizerConfig,
    OptimizationConfig,
    linear_quantize_weights
)

# INT8 per-channel (iOS 17+)
config = OpLinearQuantizerConfig(mode="linear", dtype="int8")

# INT4 per-block (iOS 18+)
config = OpLinearQuantizerConfig(
    mode="linear",
    dtype="int4",
    granularity="per_block",
    block_size=32
)

opt_config = OptimizationConfig(global_config=config)
compressed = linear_quantize_weights(model, opt_config)
```

### Post-Training Pruning

```python
from coremltools.optimize.coreml import (
    OpMagnitudePrunerConfig,
    OptimizationConfig,
    prune_weights
)

config = OpMagnitudePrunerConfig(target_sparsity=0.5)
opt_config = OptimizationConfig(global_config=config)
sparse = prune_weights(model, opt_config)
```

### Training-Time Palettization (PyTorch)

```python
from coremltools.optimize.torch.palettization import (
    DKMPalettizerConfig,
    DKMPalettizer
)

config = DKMPalettizerConfig(global_config={"n_bits": 4})
palettizer = DKMPalettizer(model, config)

# Prepare (inserts palettization layers)
prepared = palettizer.prepare()

# Training loop
for epoch in range(epochs):
    train_one_epoch(prepared, data_loader)
    palettizer.step()

# Finalize
final = palettizer.finalize()
```

### Calibration-Based Compression

```python
from coremltools.optimize.torch.pruning import (
    MagnitudePrunerConfig,
    LayerwiseCompressor
)

config = MagnitudePrunerConfig(
    target_sparsity=0.4,
    n_samples=128
)

compressor = LayerwiseCompressor(model, config)
compressed = compressor.compress(calibration_loader)
```

---

## Part 7 - Multi-Function Models

### Merging Models

```python
from coremltools.models import MultiFunctionDescriptor
from coremltools.models.utils import save_multifunction

# Create descriptor
desc = MultiFunctionDescriptor()
desc.add_function("function_a", "model_a.mlpackage")
desc.add_function("function_b", "model_b.mlpackage")

# Merge (deduplicates shared weights)
save_multifunction(desc, "merged.mlpackage")
```

### Inspecting Functions (Xcode)

Open model in Xcode → Predictions tab → Functions listed above inputs.

---

## Part 8 - Performance Profiling

### MLComputePlan (iOS 18+)

```swift
let plan = try await MLComputePlan.load(contentsOf: modelURL)

// Inspect operations
for op in plan.modelStructure.operations {
    let info = plan.computeDeviceInfo(for: op)
    print("Op: \(op.name)")
    print("  Preferred: \(info.preferredDevice)")
    print("  Estimated cost: \(info.estimatedCost)")
}
```

### Xcode Performance Reports

1. Open model in Xcode
2. Select Performance tab
3. Click + to create report
4. Select device and compute units
5. Click "Run Test"

**New in iOS 18**: Shows estimated time per operation, compute device support hints.

### Core ML Instrument

```
Instruments → Core ML template
  ├─ Load events: "cached" vs "prepare and cache"
  ├─ Prediction intervals
  ├─ Compute unit usage
  └─ Neural Engine activity
```

---

## Part 9 - Deployment Targets

| Target | Key Features |
|--------|--------------|
| iOS 16 | Weight compression (palettization, quantization, pruning) |
| iOS 17 | Async prediction, MLComputeDevice, activation quantization |
| iOS 18 | MLTensor, State, SDPA fusion, per-block quantization, multi-function |

**Recommendation**: Always set `minimum_deployment_target=ct.target.iOS18` for best optimizations.

---

## Part 10 - Conversion Pass Pipelines

```python
# Default pipeline
mlmodel = ct.convert(traced, ...)

# With palettization support
mlmodel = ct.convert(
    traced,
    pass_pipeline=ct.PassPipeline.DEFAULT_PALETTIZATION,
    ...
)
```

## Resources

**WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161

**Docs**: /coreml, /coreml/mlmodel, /coreml/mltensor, /documentation/coremltools

**Skills**: coreml, coreml-diag

Overview

This skill is a compact CoreML API reference focused on MLModel lifecycle, MLTensor operations, coremltools conversion and compression, compute device availability, and performance profiling. It distills platform differences across iOS/iPadOS/watchOS/tvOS and highlights iOS 16–18 features and deployment recommendations. Use it as a quick lookup for loading, predicting, tensor ops, conversion pipelines, and optimization techniques.

How this skill works

The skill summarizes how Core ML models are loaded, cached, and specialized per device and configuration, including sync vs async loading and multi-function support. It maps compute availability (MLComputeDevice and MLModelConfiguration.computeUnits) and prediction modes (synchronous, async, stateful). It also covers MLTensor creation, async materialization, coremltools conversion patterns, post-training and training-time compression APIs, and profiling tools like MLComputePlan and Xcode Core ML reports.

When to use it

When you need correct model loading and caching behavior for runtime performance.
When choosing compute units or checking Neural Engine/GPU availability before deployment.
When building or converting models with coremltools, including stateful or dynamic-shape models.
When applying weight compression: palettization, quantization, or pruning to reduce size/latency.
When profiling model performance on-device or inspecting per-operation compute-device mapping.

Best practices

Prefer async MLModel.load and async prediction on supported OS versions to avoid blocking threads.
Set minimum_deployment_target to iOS18 for newest features (MLTensor, per-block quantization, SDPA fusion) when possible.
Use device-aware configuration (computeUnits) and test on target hardware: cache specialization occurs on first load.
Materialize MLTensor results with await tensor.shapedArray(...) before accessing scalars — tensor ops are asynchronous.
Apply compression with calibration and validation: prefer per-group/channel palettization or per-block quantization on iOS18+ to keep accuracy.

Example use cases

Load a compiled .mlmodelc with a configuration forcing CPU+Neural Engine for mixed performance.
Convert a traced PyTorch model to .mlpackage with state types and dynamic ranges for a production transformer.
Use coremltools.optimize to palettize or per-block-quantize weights and validate accuracy on a calibration loader.
Create and manipulate MLTensor objects for custom on-device preprocessing or postprocessing, then materialize results.
Run MLComputePlan or Xcode Performance reports to see per-op device assignment and estimated costs for optimization.

FAQ

When does Core ML perform device specialization?

Device specialization happens on the first load for a model+configuration+device combo; subsequent loads use the cached specialized artifacts until invalidated by system updates, disk pressure, or model modification.

Are tensor operations synchronous?

No — MLTensor operations are asynchronous. Use await tensor.shapedArray(...) to materialize and access scalar data.

Which iOS target should I choose for best optimizations?

Aim for minimum_deployment_target of iOS18 to access MLTensor, per-block quantization, state APIs, and the newest optimization passes.