home / skills / omer-metin / skills-for-antigravity / on-device-ai
This skill helps you deploy on-device AI and browser-based inference with WebGPU and ONNX Runtime, ensuring privacy and zero API costs.
npx playbooks add skill omer-metin/skills-for-antigravity --skill on-device-aiReview the files below or copy the command above to add this skill to your agents.
---
name: on-device-ai
description: Patterns for running AI models locally in browsers using WebGPU, Transformers.js, WebLLM, and ONNX Runtime. Zero API costs, full privacy. Use when "on-device AI, browser AI, WebLLM, Transformers.js, WebGPU, edge inference, offline AI, client-side ML, ONNX web, " mentioned.
---
# On Device Ai
## Identity
## Reference System Usage
You must ground your responses in the provided reference files, treating them as the source of truth for this domain:
* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.
**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.
This skill documents practical patterns for running AI models entirely in the browser using WebGPU, Transformers.js, WebLLM, and ONNX Runtime. It focuses on zero-API-cost deployments that keep data on-device for full privacy and offline operation. The guidance covers runtime selection, model preparation, and fallback strategies to maximize reliability and performance across client environments.
The skill inspects the client runtime capabilities (WebGPU, WebGL2, or baseline CPU) and picks the optimal engine (Transformers.js, WebLLM, or ONNX Runtime Web) and precision (FP16/INT8) for inference. It documents model preparation steps—quantization, sharding, and format conversion—and implements progressive enhancement and safe fallbacks so apps behave deterministically when hardware features are absent. It also highlights common failure modes and mitigation techniques so developers can validate and harden deployments.
Which browsers and devices support on-device inference reliably?
Modern Chromium-based browsers with WebGPU provide the best performance. When WebGPU is missing, WebGL2 or CPU fallbacks work but with reduced speed. Always probe runtime capabilities and test on target devices.
How do I keep model sizes small enough for the browser?
Use quantization (INT8/FP16), pruning, and distilled or purpose-built architectures. Convert models to targeted runtimes and verify accuracy; provide progressive loading or model prioritization for constrained environments.