home / skills / omer-metin / skills-for-antigravity / on-device-ai

on-device-ai skill

safe

This skill helps you deploy on-device AI and browser-based inference with WebGPU and ONNX Runtime, ensuring privacy and zero API costs.

npx playbooks add skill omer-metin/skills-for-antigravity --skill on-device-ai

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

1.1 KB

---
name: on-device-ai
description: Patterns for running AI models locally in browsers using WebGPU, Transformers.js, WebLLM, and ONNX Runtime. Zero API costs, full privacy. Use when "on-device AI, browser AI, WebLLM, Transformers.js, WebGPU, edge inference, offline AI, client-side ML, ONNX web, " mentioned. 
---

# On Device Ai

## Identity



## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill documents practical patterns for running AI models entirely in the browser using WebGPU, Transformers.js, WebLLM, and ONNX Runtime. It focuses on zero-API-cost deployments that keep data on-device for full privacy and offline operation. The guidance covers runtime selection, model preparation, and fallback strategies to maximize reliability and performance across client environments.

How this skill works

The skill inspects the client runtime capabilities (WebGPU, WebGL2, or baseline CPU) and picks the optimal engine (Transformers.js, WebLLM, or ONNX Runtime Web) and precision (FP16/INT8) for inference. It documents model preparation steps—quantization, sharding, and format conversion—and implements progressive enhancement and safe fallbacks so apps behave deterministically when hardware features are absent. It also highlights common failure modes and mitigation techniques so developers can validate and harden deployments.

When to use it

You need zero-latency inference with strict user privacy and no server round trips.
You want offline-first experiences or must operate in restricted network environments.
You must avoid API costs and centrally hosted model dependencies.
You are prototyping client-side personalization, UI assistants, or lightweight generative features.
You need deterministic behavior and reproducible inference across devices.

Best practices

Run a capability probe at startup to detect WebGPU/WebGL/threads and choose the runtime dynamically.
Prefer quantized or distilled models to reduce memory and compute; validate accuracy after conversion.
Bundle multiple fallback runtimes and implement graceful degradation to CPU when GPU is unavailable.
Use streaming and chunked tokenization for long inputs to avoid OOM on low-memory devices.
Measure end-to-end latency on representative devices and expose quality/performance modes to users.

Example use cases

On-device chat assistants using small/quantized LLMs for privacy-preserving conversations.
Client-side text summarization and extraction for sensitive documents without uploading data.
Real-time recommendation or personalization computed locally from user behavior signals.
Offline image captioning or tagging in web apps using small vision models with ONNX Web.
AR/VR helpers running in the browser with WebGPU-accelerated inference for low-latency feedback.

FAQ

Which browsers and devices support on-device inference reliably?

Modern Chromium-based browsers with WebGPU provide the best performance. When WebGPU is missing, WebGL2 or CPU fallbacks work but with reduced speed. Always probe runtime capabilities and test on target devices.

How do I keep model sizes small enough for the browser?

Use quantization (INT8/FP16), pruning, and distilled or purpose-built architectures. Convert models to targeted runtimes and verify accuracy; provide progressive loading or model prioritization for constrained environments.