← Skills

model-audit

Weekly audit of AI model performance, costs, and fallback health

Model Audit Skill

Purpose

Weekly audit of AI model landscape to ensure every agent runs the best available model for their domain. Maintains a 3-tier model stack per agent: best-in-class default, budget runner-up, and free failsafe.

Model Stack Philosophy

Every agent gets a 3-tier model stack optimized for their specific domain:

  • Tier 1 (Default): The absolute best model on the market for that agent's domain — cost is irrelevant, quality is king
  • Tier 2 (Budget): Best runner-up budget model — activated if we're burning credits too fast
  • Tier 3 (Failsafe): Best free-tier model — last resort fallback to keep the agent running

Agent Domain Mapping

AgentDomainOptimization Priority
Alfred (orchestrator)General orchestration, multi-step planning, tool useReasoning + tool calling
Devin (delivery_ops)Software engineering, code generation, debuggingCoding benchmarks (HumanEval, SWE-bench)
Rene (rnd)Deep research, analysis, reasoning chainsReasoning + knowledge (MMLU, GPQA)
Rick (revenue)Sales writing, persuasion, communicationWriting quality + instruction following
Laura (legal)Legal analysis, compliance, contract reviewReasoning + factual accuracy
Persephany (people)HR, team evaluation, communicationInstruction following + empathy
Daniel (design)UI/UX design, visual analysis, image understandingVision + multimodal benchmarks
Friedrich (finance)Financial analysis, calculations, reportingMath + reasoning (MATH, GSM8K)

Execution Steps

Step 1: Research Current Leaderboards

Browse these sources for the latest model rankings:

  1. https://artificialanalysis.ai/leaderboards/models — Overall quality, speed, and price comparison
  2. https://openrouter.ai/models — Available models with pricing and capabilities
  3. https://lmarena.ai — Chatbot Arena ELO rankings (human preference)
  4. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard — Open model benchmarks

For each agent domain, identify:

  • Which model is currently #1 in that domain's benchmarks
  • Which model offers best quality-per-dollar (budget tier)
  • Which free model performs best in that domain

Step 2: Build Recommended Stacks

For each agent, construct a 3-tier stack:

Agent: [name] ([id]) Domain: [domain] Current Stack: [read from openclaw.json] Recommended Stack: Tier 1 (Default): [model] — [reasoning: benchmark X score Y] Tier 2 (Budget): [model] — [reasoning: cost vs quality tradeoff] Tier 3 (Failsafe): [model]:free — [reasoning: best free option for domain] Changes: [list specific changes vs current config]

Step 3: Compare Against Current Config

Read ~/.openclaw/openclaw.json and compare each agent's current model config against recommendations:

  • agents.list[].model.primary = Tier 1
  • agents.list[].model.fallbacks[0] = Tier 2
  • Remaining fallbacks = Tier 3 chain
  • agents.defaults.models = allowlist (must include all recommended models)

Flag only meaningful upgrades — don't recommend changes for marginal improvements.

Step 4: Write Audit Report

Save report to ~/.openclaw/workspace/knowledge/areas/model-audit-log.md:

# Model Audit Log ## [Date] Audit **Audited by**: Rene **Sources checked**: [list URLs] ### Recommendations [Per-agent recommendations from Step 2] ### Summary - Agents requiring updates: [count] - Models to add to allowlist: [list] - Estimated impact: [brief assessment]

Step 5: Send to Alfred for Approval Chain

Send audit results to Alfred via Telegram with:

  1. Summary of recommended changes
  2. Per-agent breakdown (only agents that need updates)
  3. Clear instruction: "Please forward to Clayton for approval before applying."

Alfred's responsibility: Forward the suggestions to Clayton. If Clayton approves, Alfred updates openclaw.json with the new model stacks.

Important Notes

  • Never auto-apply model changes — always go through the approval chain
  • Models must be available on OpenRouter (prefix: openrouter/)
  • Verify models are actually available (not deprecated/removed) before recommending
  • Consider context window size — some tasks need large context (128K+)
  • Consider speed vs quality tradeoff per agent's typical workload
  • Free tier models should use the :free suffix on OpenRouter
View raw SKILL.md
# Model Audit Skill

## Purpose
Weekly audit of AI model landscape to ensure every agent runs the best available model for their domain. Maintains a 3-tier model stack per agent: best-in-class default, budget runner-up, and free failsafe.

## Model Stack Philosophy
Every agent gets a 3-tier model stack optimized for their specific domain:
- **Tier 1 (Default)**: The absolute best model on the market for that agent's domain — cost is irrelevant, quality is king
- **Tier 2 (Budget)**: Best runner-up budget model — activated if we're burning credits too fast
- **Tier 3 (Failsafe)**: Best free-tier model — last resort fallback to keep the agent running

## Agent Domain Mapping
| Agent | Domain | Optimization Priority |
|-------|--------|----------------------|
| Alfred (orchestrator) | General orchestration, multi-step planning, tool use | Reasoning + tool calling |
| Devin (delivery_ops) | Software engineering, code generation, debugging | Coding benchmarks (HumanEval, SWE-bench) |
| Rene (rnd) | Deep research, analysis, reasoning chains | Reasoning + knowledge (MMLU, GPQA) |
| Rick (revenue) | Sales writing, persuasion, communication | Writing quality + instruction following |
| Laura (legal) | Legal analysis, compliance, contract review | Reasoning + factual accuracy |
| Persephany (people) | HR, team evaluation, communication | Instruction following + empathy |
| Daniel (design) | UI/UX design, visual analysis, image understanding | Vision + multimodal benchmarks |
| Friedrich (finance) | Financial analysis, calculations, reporting | Math + reasoning (MATH, GSM8K) |

## Execution Steps

### Step 1: Research Current Leaderboards
Browse these sources for the latest model rankings:
1. **https://artificialanalysis.ai/leaderboards/models** — Overall quality, speed, and price comparison
2. **https://openrouter.ai/models** — Available models with pricing and capabilities
3. **https://lmarena.ai** — Chatbot Arena ELO rankings (human preference)
4. **https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard** — Open model benchmarks

For each agent domain, identify:
- Which model is currently #1 in that domain's benchmarks
- Which model offers best quality-per-dollar (budget tier)
- Which free model performs best in that domain

### Step 2: Build Recommended Stacks
For each agent, construct a 3-tier stack:
```
Agent: [name] ([id])
Domain: [domain]
Current Stack: [read from openclaw.json]
Recommended Stack:
  Tier 1 (Default): [model] — [reasoning: benchmark X score Y]
  Tier 2 (Budget):  [model] — [reasoning: cost vs quality tradeoff]
  Tier 3 (Failsafe): [model]:free — [reasoning: best free option for domain]
Changes: [list specific changes vs current config]
```

### Step 3: Compare Against Current Config
Read `~/.openclaw/openclaw.json` and compare each agent's current model config against recommendations:
- `agents.list[].model.primary` = Tier 1
- `agents.list[].model.fallbacks[0]` = Tier 2
- Remaining fallbacks = Tier 3 chain
- `agents.defaults.models` = allowlist (must include all recommended models)

Flag only meaningful upgrades — don't recommend changes for marginal improvements.

### Step 4: Write Audit Report
Save report to `~/.openclaw/workspace/knowledge/areas/model-audit-log.md`:
```markdown
# Model Audit Log

## [Date] Audit
**Audited by**: Rene
**Sources checked**: [list URLs]

### Recommendations
[Per-agent recommendations from Step 2]

### Summary
- Agents requiring updates: [count]
- Models to add to allowlist: [list]
- Estimated impact: [brief assessment]
```

### Step 5: Send to Alfred for Approval Chain
Send audit results to Alfred via Telegram with:
1. Summary of recommended changes
2. Per-agent breakdown (only agents that need updates)
3. Clear instruction: "Please forward to Clayton for approval before applying."

**Alfred's responsibility**: Forward the suggestions to Clayton. If Clayton approves, Alfred updates `openclaw.json` with the new model stacks.

## Important Notes
- **Never auto-apply model changes** — always go through the approval chain
- Models must be available on OpenRouter (prefix: `openrouter/`)
- Verify models are actually available (not deprecated/removed) before recommending
- Consider context window size — some tasks need large context (128K+)
- Consider speed vs quality tradeoff per agent's typical workload
- Free tier models should use the `:free` suffix on OpenRouter