home / skills / aaaaqwq / agi-super-skills / model-fallback

model-fallback skill

safe

/skills/ai-multimodal/model-fallback

This skill automatically downgrades to backup models when the primary fails, ensuring continuous service with intelligent routing and retrying.

npx playbooks add skill aaaaqwq/agi-super-skills --skill model-fallback

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

11.7 KB

---
name: model-fallback
description: 模型自动降级与故障切换。当主模型请求失败、超时、达到速率限制或配额耗尽时，自动切换到备用模型，确保服务连续性。支持多供应商、多优先级的智能模型选择，提供健康监控、自动重试和错误恢复机制。
allowed-tools: Bash, Read, Write, Edit, Exec
---

# 模型自动降级与故障切换

## 概述

本技能提供完整的模型自动降级和故障切换解决方案，确保在主模型不可用时自动切换到备用模型，维持 AI 助手的持续服务能力。

## 核心功能

### 1. 智能模型选择

根据任务类型和模型状态智能选择最合适的模型：

```yaml
优先级顺序:
  1. anapi/opus-4.5         # 最强能力，最高优先级
  2. zai/glm-4.7            # 中文优化，性价比高
  3. openrouter-vip/gpt-5.2-codex  # 编码专用
  4. github-copilot/claude-sonnet-4-5  # 免费备用
```

### 2. 故障检测与处理

自动检测以下故障并触发切换：

- **超时错误** (timeout): 请求超过设定时间
- **速率限制** (rate_limit): API 调用频率超限
- **配额耗尽** (quota_exceeded): token 配额用尽
- **认证错误** (authentication): API 密钥失效
- **服务不可用** (service_unavailable): 供应商服务故障
- **网络错误** (network_error): 连接失败

### 3. 重试机制

智能重试策略：

```yaml
重试配置:
  最大重试次数: 3
  初始延迟: 1000ms
  最大延迟: 10000ms
  退避倍数: 2.0
  使用备用模型: true
```

重试时间线：
```
第1次失败 → 等待 1s → 重试
第2次失败 → 等待 2s → 重试
第3次失败 → 等待 4s → 切换模型
```

### 4. 健康监控

持续监控所有模型供应商的健康状态：

```bash
# 每5分钟检查一次
检查项目:
  - API 端点连通性
  - 响应时间
  - 错误率
  - 配额使用情况
```

## 快速开始

### 1. 配置模型降级

配置文件：`~/.openclaw/agents/main/agent/agent.json`

```json
{
  "model": "anapi/opus-4.5",
  "modelFallback": [
    "zai/glm-4.7",
    "openrouter-vip/gpt-5.2-codex",
    "github-copilot/claude-sonnet-4-5"
  ],
  "retry": {
    "maxAttempts": 3,
    "initialDelayMs": 1000,
    "maxDelayMs": 10000,
    "backoffMultiplier": 2.0,
    "useFallbackOnFailure": true
  }
}
```

### 2. 启动监控

```bash
# 启动后台监控
~/.openclaw/scripts/monitor-models.sh start

# 查看监控状态
~/.openclaw/scripts/monitor-models.sh status

# 停止监控
~/.openclaw/scripts/monitor-models.sh stop
```

### 3. 手动触发切换

```bash
# 运行降级检查脚本
~/.openclaw/scripts/model-fallback.sh
```

## 工作流程

### 正常请求流程

```
用户请求
    ↓
选择主模型 (anapi/opus-4.5)
    ↓
发送请求到 API
    ↓
成功 → 返回结果
```

### 故障切换流程

```
用户请求
    ↓
选择当前模型
    ↓
发送请求 → 失败/超时
    ↓
重试 (最多3次)
    ↓
仍然失败?
    ↓
切换到备用模型 (zai/glm-4.7)
    ↓
发送请求
    ↓
成功 → 返回结果并记录
```

### 监控流程

```
监控守护进程 (每5分钟)
    ↓
检查所有模型健康状态
    ↓
├─ 主模型健康 → 保持当前
└─ 主模型不健康 → 切换到最佳可用模型
    ↓
更新状态文件
    ↓
记录日志
```

## 错误处理策略

### 1. 超时错误 (Timeout)

```json
{
  "timeout": {
    "switchModel": true,
    "retryCount": 2,
    "timeoutMs": 60000,
    "fallbackTo": "zai/glm-4.7"
  }
}
```

**行为**：
- 超过 60 秒视为超时
- 重试 2 次后仍超时则切换模型

### 2. 速率限制 (Rate Limit)

```json
{
  "rateLimit": {
    "switchModel": true,
    "cooldownMs": 60000,
    "alert": true,
    "fallbackTo": "zai/glm-4.7"
  }
}
```

**行为**：
- 收到 429 错误码
- 冷却 60 秒后尝试恢复
- 立即切换到备用模型

### 3. 配额耗尽 (Quota Exceeded)

```json
{
  "quotaExceeded": {
    "switchModel": true,
    "alert": true,
    "fallbackTo": "zai/glm-4.7",
    "checkInterval": 3600000
  }
}
```

**行为**：
- 配额用尽时切换模型
- 每小时检查一次主模型是否恢复
- 发送告警通知

### 4. 认证错误 (Authentication)

```json
{
  "authenticationError": {
    "switchModel": true,
    "alert": true,
    "disableModel": true
  }
}
```

**行为**：
- API 密钥失效时立即切换
- 禁用故障模型（不自动恢复）
- 发送紧急告警

## 智能路由规则

根据任务类型自动选择最合适的模型：

```json
{
  "routing": {
    "strategy": "priority-fallback",
    "rules": [
      {
        "name": "coding-task",
        "match": {
          "contentContains": ["代码", "code", "编程", "函数"]
        },
        "preferModels": [
          "openrouter-vip/gpt-5.2-codex",
          "anapi/opus-4.5"
        ]
      },
      {
        "name": "chinese-task",
        "match": {
          "language": "zh"
        },
        "preferModels": [
          "zai/glm-4.7",
          "anapi/opus-4.5"
        ]
      },
      {
        "name": "vision-task",
        "match": {
          "hasImage": true
        },
        "preferModels": [
          "anapi/opus-4.5"
        ]
      }
    ]
  }
}
```

## 监控与日志

### 日志文件位置

```bash
~/.openclaw/logs/model-fallback.log    # 切换日志
~/.openclaw/logs/model-monitor.log     # 监控日志
~/.openclaw/logs/model-status.json     # 状态报告
```

### 查看实时日志

```bash
# 查看切换日志
tail -f ~/.openclaw/logs/model-fallback.log

# 查看监控日志
tail -f ~/.openclaw/logs/model-monitor.log

# 查看所有日志
tail -f ~/.openclaw/logs/*.log
```

### 状态报告

```bash
# 查看当前状态
~/.openclaw/scripts/monitor-models.sh status

# JSON 格式状态
cat ~/.openclaw/logs/model-status.json | python3 -m json.tool
```

## 脚本说明

### model-fallback.sh

模型降级切换脚本，负责：
- 测试所有配置的模型
- 选择最佳可用模型
- 执行模型切换
- 记录切换日志

**用法**：
```bash
~/.openclaw/scripts/model-fallback.sh
```

### monitor-models.sh

健康监控守护进程，负责：
- 定期检查模型健康状态
- 自动触发故障切换
- 生成状态报告
- 管理 PID 文件

**用法**：
```bash
~/.openclaw/scripts/monitor-models.sh {start|stop|restart|status|check}
```

### test-model-fallback.sh

测试脚本，用于：
- 验证配置正确性
- 测试切换逻辑
- 模拟故障场景
- 生成测试报告

**用法**：
```bash
~/clawd/scripts/test-model-fallback.sh
```

## 配置文件详解

### agent.json 完整配置

```json
{
  "model": "anapi/opus-4.5",
  "modelFallback": [
    "zai/glm-4.7",
    "openrouter-vip/gpt-5.2-codex",
    "github-copilot/claude-sonnet-4-5"
  ],
  "retry": {
    "maxAttempts": 3,
    "initialDelayMs": 1000,
    "maxDelayMs": 10000,
    "backoffMultiplier": 2.0,
    "useFallbackOnFailure": true
  },
  "errorHandling": {
    "rateLimit": {
      "switchModel": true,
      "cooldownMs": 60000,
      "alert": true
    },
    "timeout": {
      "switchModel": true,
      "retryCount": 2,
      "timeoutMs": 60000
    },
    "quotaExceeded": {
      "switchModel": true,
      "alert": true,
      "fallbackTo": "zai/glm-4.7"
    },
    "authenticationError": {
      "switchModel": true,
      "alert": true,
      "disableModel": true
    }
  },
  "models": {
    "anapi/opus-4.5": {
      "provider": "anapi",
      "alias": "opus45",
      "maxTokens": 200000,
      "timeoutMs": 60000,
      "priority": 1,
      "supports": ["vision", "tools", "long-context"],
      "costFactor": "high"
    },
    "zai/glm-4.7": {
      "provider": "zai",
      "alias": "zai47",
      "maxTokens": 200000,
      "timeoutMs": 60000,
      "priority": 2,
      "supports": ["tools", "long-context"],
      "costFactor": "medium",
      "bestFor": ["chinese", "general-purpose"]
    },
    "openrouter-vip/gpt-5.2-codex": {
      "provider": "openrouter-vip",
      "alias": "codex52",
      "maxTokens": 100000,
      "timeoutMs": 30000,
      "priority": 3,
      "supports": ["coding"],
      "costFactor": "low",
      "bestFor": ["coding", "code-generation"]
    },
    "github-copilot/claude-sonnet-4-5": {
      "provider": "github-copilot",
      "alias": "sonnet",
      "maxTokens": 200000,
      "timeoutMs": 60000,
      "priority": 4,
      "supports": ["tools", "long-context"],
      "costFactor": "free",
      "bestFor": ["fallback", "general-purpose"]
    }
  },
  "monitoring": {
    "enabled": true,
    "checkIntervalMs": 300000,
    "logFile": "$HOME/.openclaw/logs/model-fallback.log",
    "alertOnFailure": true
  }
}
```

## 故障排查

### 问题 1: 模型未自动切换

**检查**：
```bash
# 查看配置文件
cat ~/.openclaw/agents/main/agent/agent.json | grep modelFallback

# 查看日志
tail -20 ~/.openclaw/logs/model-fallback.log

# 手动运行切换脚本
~/.openclaw/scripts/model-fallback.sh
```

### 问题 2: 监控未运行

**检查**：
```bash
# 查看进程
ps aux | grep monitor-models

# 查看PID文件
cat ~/.openclaw/logs/model-monitor.pid

# 重启监控
~/.openclaw/scripts/monitor-models.sh restart
```

### 问题 3: 所有模型都不可用

**检查**：
```bash
# 查看状态报告
~/.openclaw/scripts/monitor-models.sh status

# 检查 API 密钥
cat ~/.openclaw/agents/main/agent/auth-profiles.json

# 测试网络连接
ping -c 3 anapi.9w7.cn
ping -c 3 open.bigmodel.cn
```

## 性能优化

### 减少切换频率

```json
{
  "retry": {
    "maxAttempts": 5,        // 增加重试次数
    "initialDelayMs": 2000   // 增加初始延迟
  }
}
```

### 优化响应时间

为不同任务选择最快的模型：

```json
{
  "routing": {
    "rules": [
      {
        "name": "quick-response",
        "match": {
          "priority": "speed"
        },
        "preferModels": [
          "github-copilot/claude-sonnet-4-5",  // 通常响应最快
          "zai/glm-4.7"
        ]
      }
    ]
  }
}
```

## 集成到 OpenClaw

配置完成后，模型降级功能会自动集成到 OpenClaw Gateway：

1. **自动重启 Gateway**:
```bash
openclaw gateway restart
```

2. **验证配置**:
```bash
openclaw status | grep Model
```

3. **查看日志**:
```bash
journalctl -u openclaw-gateway -f | grep model
```

## 最佳实践

### 1. 定期检查

每周运行一次全面检查：
```bash
~/clawd/scripts/test-model-fallback.sh
```

### 2. 监控日志

每天查看切换日志：
```bash
grep "切换模型" ~/.openclaw/logs/model-fallback.log | tail -10
```

### 3. 更新配置

当添加新模型时，更新 `agent.json`：
```json
{
  "modelFallback": [
    "anapi/opus-4.5",
    "zai/glm-4.7",
    "new-model-here",  // 新模型
    "github-copilot/claude-sonnet-4-5"
  ]
}
```

### 4. 备份配置

定期备份配置文件：
```bash
cp ~/.openclaw/agents/main/agent/agent.json \
   ~/.openclaw/agents/main/agent/agent.json.backup
```

## 相关文件

- `~/.openclaw/agents/main/agent/agent.json` - 主配置文件
- `~/.openclaw/agents/main/agent/auth-profiles.json` - API 密钥
- `~/.openclaw/scripts/model-fallback.sh` - 切换脚本
- `~/.openclaw/scripts/monitor-models.sh` - 监控脚本
- `~/clawd/scripts/test-model-fallback.sh` - 测试脚本
- `~/clawd/docs/model-fallback-strategy.md` - 技术文档

## 支持的模型

当前配置的模型及其特性：

| 模型 | 供应商 | 优先级 | 最大Token | 特长 |
|------|--------|--------|----------|------|
| opus-4.5 | anapi | 1 | 200k | 最强能力，视觉 |
| glm-4.7 | zai | 2 | 200k | 中文优化 |
| gpt-5.2-codex | openrouter-vip | 3 | 100k | 编码专用 |
| sonnet-4.5 | github-copilot | 4 | 200k | 免费备用 |

## TODO

- [ ] 添加更多供应商
- [ ] 实现基于成本的模型选择
- [ ] 添加模型性能指标收集
- [ ] 实现预测性模型切换
- [ ] 集成告警通知（Telegram/邮件）
- [ ] 添加 WebUI 监控面板

Overview

This skill implements automatic model fallback and failover to keep AI services available when a primary model fails, times out, hits rate limits, or exhausts quota. It supports multi-vendor, multi-priority model selection, health monitoring, automatic retries, and error recovery to minimize downtime and preserve user experience.

How this skill works

The skill monitors configured models and routes requests according to priority and routing rules. On request failure (timeout, rate limit, quota, auth, network, or service error) it applies an exponential backoff retry policy and, if configured, switches to the next available fallback model. A background monitor checks endpoint connectivity, latency, error rates, and quota usage and updates model health state.

When to use it

Ensure continuous service for production assistants using multiple model providers.
Protect against transient API errors, rate limits, and quota exhaustion.
Support routing tasks to specialized models (coding, Chinese, vision) with automatic fallback.
Reduce user-visible failures during provider outages or authentication problems.
Test and validate model switching behavior in staging before production rollouts.

Best practices

Define a clear priority list and routing rules that match task types and performance needs.
Use conservative retry settings with exponential backoff to avoid cascading failures.
Enable health monitoring with a reasonable check interval (e.g., 5 minutes) and log all switch events.
Mark models disabled on authentication errors to prevent repeated failures until credentials are fixed.
Regularly run test scripts and back up agent configuration before changes.

Example use cases

Primary model hits rate limit during peak traffic; system retries and then switches to the next priority model to maintain responses.
A vision-capable model becomes unavailable; routing rules prefer an alternative trained for vision tasks and fallback preserves functionality.
Code generation tasks are routed to a codex-optimized model, with a general-purpose model as a fallback if the codex provider fails.
Quota on a paid provider is exhausted; the monitor triggers hourly checks and routes traffic to lower-cost or free backups while alerting operators.

FAQ

How does retry and fallback sequence work?

The skill retries up to configured attempts with exponential backoff (initial delay, multiplier, max delay). If retries fail and useFallbackOnFailure is enabled, it switches to the next available model in priority order.

What errors trigger immediate model disable?

Authentication failures typically disable the affected model immediately and raise alerts, preventing automatic re-enable until credentials are corrected.