home / skills / martinholovsky / claude-skills-generator / wake-word-detection

wake-word-detection skill

/skills/wake-word-detection

This skill implements privacy-preserving wake word detection using openWakeWord for offline, low-resource JARVIS activation with tests by a TDD workflow for

npx playbooks add skill martinholovsky/claude-skills-generator --skill wake-word-detection

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
13.5 KB
---
name: wake-word-detection
risk_level: MEDIUM
description: "Expert skill for implementing wake word detection with openWakeWord. Covers audio monitoring, keyword spotting, privacy protection, and efficient always-listening systems for JARVIS voice assistant."
model: sonnet
---

# Wake Word Detection Skill

## 1. Overview

**Risk Level**: MEDIUM - Continuous audio monitoring, privacy implications, resource constraints

You are an expert in wake word detection with deep expertise in openWakeWord, keyword spotting, and always-listening systems.

**Primary Use Cases**:
- JARVIS activation phrase detection ("Hey JARVIS")
- Always-listening with minimal resource usage
- Offline wake word detection (no cloud dependency)

---

## 2. Core Principles

- **TDD First** - Write tests before implementation code
- **Performance Aware** - Optimize for CPU, memory, and latency
- **Privacy Preserving** - Never store audio, minimize buffers
- **Accuracy Focused** - Minimize false positives/negatives
- **Resource Efficient** - Target <5% CPU, <100MB memory

---

## 3. Core Responsibilities

### 3.1 Privacy-First Monitoring

- **Process locally** - Never send audio to external services
- **Buffer minimally** - Only keep audio needed for detection
- **Discard non-wake** - Immediately discard non-wake audio
- **User control** - Easy disable/pause functionality

### 3.2 Efficiency Requirements

- Minimal CPU usage (<5% average)
- Low memory footprint (<100MB)
- Low latency detection (<500ms)
- Low false positive rate (<1 per hour)

---

## 4. Technical Foundation

```python
# requirements.txt
openwakeword>=0.6.0
numpy>=1.24.0
sounddevice>=0.4.6
onnxruntime>=1.16.0
```

---

## 5. Implementation Workflow (TDD)

### Step 1: Write Failing Test First

```python
# tests/test_wake_word.py
import pytest
import numpy as np
from unittest.mock import Mock, patch

class TestWakeWordDetector:
    """TDD tests for wake word detection."""

    def test_detection_accuracy_threshold(self):
        """Test that detector respects confidence threshold."""
        from wake_word import SecureWakeWordDetector

        detector = SecureWakeWordDetector(threshold=0.7)
        callback = Mock()
        test_audio = np.random.randn(16000).astype(np.float32)

        with patch.object(detector.model, 'predict') as mock_predict:
            # Below threshold - should not trigger
            mock_predict.return_value = {"hey_jarvis": np.array([0.5])}
            detector._test_process(test_audio, callback)
            callback.assert_not_called()

            # Above threshold - should trigger
            mock_predict.return_value = {"hey_jarvis": np.array([0.8])}
            detector._test_process(test_audio, callback)
            callback.assert_called_once()

    def test_buffer_cleared_after_detection(self):
        """Test privacy: buffer cleared immediately after detection."""
        from wake_word import SecureWakeWordDetector

        detector = SecureWakeWordDetector()
        detector.audio_buffer.extend(np.zeros(16000))

        with patch.object(detector.model, 'predict') as mock_predict:
            mock_predict.return_value = {"hey_jarvis": np.array([0.9])}
            detector._process_audio()

        assert len(detector.audio_buffer) == 0, "Buffer must be cleared"

    def test_cpu_usage_under_threshold(self):
        """Test CPU usage stays under 5%."""
        import psutil
        import time
        from wake_word import SecureWakeWordDetector

        detector = SecureWakeWordDetector()
        process = psutil.Process()
        start_time = time.time()

        while time.time() - start_time < 10:
            audio = np.random.randn(1600).astype(np.float32)
            detector.audio_buffer.extend(audio)
            if len(detector.audio_buffer) >= 16000:
                detector._process_audio()

        avg_cpu = process.cpu_percent() / psutil.cpu_count()
        assert avg_cpu < 5, f"CPU usage too high: {avg_cpu}%"

    def test_memory_footprint(self):
        """Test memory usage stays under 100MB."""
        import tracemalloc
        from wake_word import SecureWakeWordDetector

        tracemalloc.start()
        detector = SecureWakeWordDetector()

        for _ in range(600):
            audio = np.random.randn(1600).astype(np.float32)
            detector.audio_buffer.extend(audio)

        current, peak = tracemalloc.get_traced_memory()
        tracemalloc.stop()

        peak_mb = peak / 1024 / 1024
        assert peak_mb < 100, f"Memory too high: {peak_mb}MB"
```

### Step 2: Implement Minimum to Pass

```python
class SecureWakeWordDetector:
    def __init__(self, threshold=0.5):
        self.threshold = threshold
        self.model = Model(wakeword_models=["hey_jarvis"])
        self.audio_buffer = deque(maxlen=24000)

    def _test_process(self, audio, callback):
        predictions = self.model.predict(audio)
        for model_name, scores in predictions.items():
            if np.max(scores) > self.threshold:
                self.audio_buffer.clear()
                callback(model_name, np.max(scores))
                break
```

### Step 3: Run Full Verification

```bash
pytest tests/test_wake_word.py -v
pytest --cov=wake_word --cov-report=term-missing
```

---

## 6. Implementation Patterns

### Pattern 1: Secure Wake Word Detector

```python
from openwakeword.model import Model
import numpy as np
import sounddevice as sd
from collections import deque
import structlog

logger = structlog.get_logger()

class SecureWakeWordDetector:
    """Privacy-preserving wake word detection."""

    def __init__(self, model_path: str = None, threshold: float = 0.5, sample_rate: int = 16000):
        if model_path:
            self.model = Model(wakeword_models=[model_path])
        else:
            self.model = Model(wakeword_models=["hey_jarvis"])

        self.threshold = threshold
        self.sample_rate = sample_rate
        self.buffer_size = int(sample_rate * 1.5)
        self.audio_buffer = deque(maxlen=self.buffer_size)
        self.is_listening = False
        self.on_wake = None

    def start(self, callback):
        """Start listening for wake word."""
        self.on_wake = callback
        self.is_listening = True

        def audio_callback(indata, frames, time, status):
            if not self.is_listening:
                return
            audio = indata[:, 0] if len(indata.shape) > 1 else indata
            self.audio_buffer.extend(audio)
            if len(self.audio_buffer) >= self.sample_rate:
                self._process_audio()

        self.stream = sd.InputStream(
            samplerate=self.sample_rate, channels=1, dtype=np.float32,
            callback=audio_callback, blocksize=int(self.sample_rate * 0.1)
        )
        self.stream.start()

    def _process_audio(self):
        """Process audio buffer for wake word."""
        audio = np.array(list(self.audio_buffer))
        predictions = self.model.predict(audio)

        for model_name, scores in predictions.items():
            if np.max(scores) > self.threshold:
                self.audio_buffer.clear()  # Privacy: clear immediately
                if self.on_wake:
                    self.on_wake(model_name, np.max(scores))
                break

    def stop(self):
        """Stop listening."""
        self.is_listening = False
        if hasattr(self, 'stream'):
            self.stream.stop()
            self.stream.close()
        self.audio_buffer.clear()
```

### Pattern 2: False Positive Reduction

```python
class RobustDetector:
    """Reduce false positives with confirmation."""

    def __init__(self, detector: SecureWakeWordDetector):
        self.detector = detector
        self.detection_history = []
        self.confirmation_window = 2.0
        self.min_confirmations = 2

    def on_potential_wake(self, model: str, confidence: float):
        now = time.time()
        self.detection_history.append({"time": now, "confidence": confidence})
        self.detection_history = [d for d in self.detection_history if now - d["time"] < self.confirmation_window]

        if len(self.detection_history) >= self.min_confirmations:
            avg_confidence = np.mean([d["confidence"] for d in self.detection_history])
            if avg_confidence > 0.6:
                self.detection_history.clear()
                return True
        return False
```

---

## 7. Performance Patterns

### Pattern 1: Model Quantization

```python
# Good - Use quantized ONNX model
import onnxruntime as ort

class QuantizedDetector:
    def __init__(self, model_path: str):
        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        self.session = ort.InferenceSession(model_path, sess_options, providers=['CPUExecutionProvider'])

# Bad - Full precision model
class SlowDetector:
    def __init__(self, model_path: str):
        self.session = ort.InferenceSession(model_path)  # No optimization
```

### Pattern 2: Efficient Audio Buffering

```python
# Good - Pre-allocated numpy buffer with circular indexing
class EfficientBuffer:
    def __init__(self, size: int):
        self.buffer = np.zeros(size, dtype=np.float32)
        self.write_idx = 0
        self.size = size

    def append(self, audio: np.ndarray):
        n = len(audio)
        end_idx = (self.write_idx + n) % self.size
        if end_idx > self.write_idx:
            self.buffer[self.write_idx:end_idx] = audio
        else:
            self.buffer[self.write_idx:] = audio[:self.size - self.write_idx]
            self.buffer[:end_idx] = audio[self.size - self.write_idx:]
        self.write_idx = end_idx

# Bad - Individual appends
class SlowBuffer:
    def append(self, audio: np.ndarray):
        for sample in audio:  # Slow!
            self.buffer.append(sample)
```

### Pattern 3: VAD Preprocessing

```python
# Good - Skip inference on silence
import webrtcvad

class VADOptimizedDetector:
    def __init__(self):
        self.vad = webrtcvad.Vad(2)
        self.detector = SecureWakeWordDetector()

    def process(self, audio: np.ndarray):
        audio_int16 = (audio * 32767).astype(np.int16)
        if not self.vad.is_speech(audio_int16.tobytes(), 16000):
            return None  # Skip expensive inference
        return self.detector._process_audio()

# Bad - Always run inference
class WastefulDetector:
    def process(self, audio: np.ndarray):
        return self.detector._process_audio()  # Even on silence
```

### Pattern 4: Batch Inference

```python
# Good - Process multiple windows in single inference
class BatchDetector:
    def __init__(self, batch_size: int = 4):
        self.batch_size = batch_size
        self.pending_windows = []

    def add_window(self, audio: np.ndarray):
        self.pending_windows.append(audio)
        if len(self.pending_windows) >= self.batch_size:
            batch = np.stack(self.pending_windows)
            results = self.model.predict_batch(batch)
            self.pending_windows.clear()
            return results
        return None
```

### Pattern 5: Memory-Mapped Models

```python
# Good - Memory-map large model files
import mmap

class MmapModelLoader:
    def __init__(self, model_path: str):
        self.file = open(model_path, 'rb')
        self.mmap = mmap.mmap(self.file.fileno(), 0, access=mmap.ACCESS_READ)

# Bad - Load entire model into memory
class EagerModelLoader:
    def __init__(self, model_path: str):
        with open(model_path, 'rb') as f:
            self.model_data = f.read()  # Entire model in RAM
```

---

## 8. Security Standards

```python
class PrivacyController:
    """Ensure privacy in always-listening system."""

    def __init__(self):
        self.is_enabled = True
        self.last_activity = time.time()

    def check_privacy_mode(self) -> bool:
        if self._is_dnd_enabled():
            return False
        if time.time() - self.last_activity > 3600:
            return False
        return self.is_enabled

# Data minimization
MAX_BUFFER_SECONDS = 2.0
def on_wake_detected():
    audio_buffer.clear()  # Delete immediately
```

---

## 9. Common Mistakes

```python
# BAD - Stores all audio
def on_audio(chunk):
    with open("audio.raw", "ab") as f:
        f.write(chunk)

# GOOD - Discard after processing
def on_audio(chunk):
    buffer.extend(chunk)
    process_buffer()

# BAD - Large buffer
buffer = deque(maxlen=sample_rate * 60)  # 1 minute!

# GOOD - Minimal buffer
buffer = deque(maxlen=sample_rate * 1.5)  # 1.5 seconds
```

---

## 10. Pre-Implementation Checklist

### Phase 1: Before Writing Code

- [ ] Read TDD workflow section completely
- [ ] Set up test file with detection accuracy tests
- [ ] Define threshold and performance targets
- [ ] Identify which performance patterns apply
- [ ] Review privacy requirements

### Phase 2: During Implementation

- [ ] Write failing test for each feature first
- [ ] Implement minimal code to pass test
- [ ] Apply performance patterns (VAD, quantization)
- [ ] Buffer size minimal (<2 seconds)
- [ ] Audio cleared after detection

### Phase 3: Before Committing

- [ ] All tests pass: `pytest tests/test_wake_word.py -v`
- [ ] Coverage >80%: `pytest --cov=wake_word`
- [ ] False positive rate <1/hour tested
- [ ] CPU usage <5% measured
- [ ] Memory usage <100MB verified
- [ ] Audio never stored to disk

---

## 11. Summary

Your goal is to create wake word detection that is:
- **Private**: Audio processed locally, minimal retention
- **Efficient**: Low CPU (<5%), low memory (<100MB)
- **Accurate**: Low false positive rate (<1/hour)
- **Test-Driven**: All features have tests first

**Critical Reminders**:
1. Write tests before implementation
2. Never store audio to disk
3. Keep buffer minimal (<2 seconds)
4. Apply performance patterns (VAD, quantization)

Overview

This skill implements privacy-first wake word detection using openWakeWord and proven always-listening patterns for a JARVIS-style assistant. It focuses on low-latency, low-resource keyword spotting that runs offline and clears audio immediately to preserve privacy. The workflow is test-driven and optimized for CPU, memory, and false positive reduction.

How this skill works

The detector captures audio from a local input stream, buffers a short sliding window, and runs a lightweight model inference when enough audio accumulates. It applies optimizations such as VAD checks, batch or quantized inference, and memory-mapped model loading to minimize CPU and RAM. On detection the buffer is cleared immediately and a callback is invoked; no raw audio is stored or sent externally.

When to use it

  • Offline activation phrase detection for a local assistant (e.g., "Hey JARVIS")
  • Always-listening systems where privacy and local processing are required
  • Resource-constrained devices needing <5% CPU and <100MB memory
  • Prototyping or deploying wake-word models with TDD and verification
  • Environments where cloud audio streaming is unacceptable for compliance

Best practices

  • Start with tests: write failing tests for detection, privacy, and performance
  • Keep audio buffer minimal (<= 1.5–2.0 seconds) and clear immediately after use
  • Run VAD to skip inference for silence and reduce CPU usage
  • Use quantized or optimized ONNX models and memory-map large files
  • Confirm detections with a short confirmation window to reduce false positives

Example use cases

  • Trigger a local skill or command processor when user says the activation phrase
  • Always-listening kiosk or home assistant that must never send audio off-device
  • Embedded systems or Raspberry Pi deployments requiring strict resource caps
  • Automated tests that validate detection thresholds, CPU, and memory budgets
  • Integrating with a higher-level confirmation layer to lower false alarms

FAQ

How is user privacy preserved?

Audio is processed locally, buffers are minimal, and raw audio is cleared immediately after detection; nothing is stored or forwarded.

How do I reduce false positives?

Combine confidence thresholds with short confirmation windows, average recent detections, and run VAD to avoid spurious triggers on noise.