model-audit

Weekly audit of AI model performance, costs, and fallback health

Used by

Cron Jobs

Model Audit Skill

Purpose

Weekly audit of AI model landscape to ensure every agent runs the best available model for their domain. Maintains a 3-tier model stack per agent: best-in-class default, budget runner-up, and free failsafe.

Model Stack Philosophy

Every agent gets a 3-tier model stack optimized for their specific domain:

Tier 1 (Default): The absolute best model on the market for that agent's domain — cost is irrelevant, quality is king
Tier 2 (Budget): Best runner-up budget model — activated if we're burning credits too fast
Tier 3 (Failsafe): Best free-tier model — last resort fallback to keep the agent running

Agent Domain Mapping

Agent	Domain	Optimization Priority
Alfred (orchestrator)	General orchestration, multi-step planning, tool use	Reasoning + tool calling
Devin (delivery_ops)	Software engineering, code generation, debugging	Coding benchmarks (HumanEval, SWE-bench)
Rene (rnd)	Deep research, analysis, reasoning chains	Reasoning + knowledge (MMLU, GPQA)
Rick (revenue)	Sales writing, persuasion, communication	Writing quality + instruction following
Laura (legal)	Legal analysis, compliance, contract review	Reasoning + factual accuracy
Persephany (people)	HR, team evaluation, communication	Instruction following + empathy
Daniel (design)	UI/UX design, visual analysis, image understanding	Vision + multimodal benchmarks
Friedrich (finance)	Financial analysis, calculations, reporting	Math + reasoning (MATH, GSM8K)

Execution Steps

Step 1: Research Current Leaderboards

Browse these sources for the latest model rankings:

https://artificialanalysis.ai/leaderboards/models — Overall quality, speed, and price comparison
https://openrouter.ai/models — Available models with pricing and capabilities
https://lmarena.ai — Chatbot Arena ELO rankings (human preference)
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard — Open model benchmarks

For each agent domain, identify:

Which model is currently #1 in that domain's benchmarks
Which model offers best quality-per-dollar (budget tier)
Which free model performs best in that domain

Step 2: Build Recommended Stacks

For each agent, construct a 3-tier stack:

Agent: [name] ([id])
Domain: [domain]
Current Stack: [read from openclaw.json]
Recommended Stack:
  Tier 1 (Default): [model] — [reasoning: benchmark X score Y]
  Tier 2 (Budget):  [model] — [reasoning: cost vs quality tradeoff]
  Tier 3 (Failsafe): [model]:free — [reasoning: best free option for domain]
Changes: [list specific changes vs current config]

Step 3: Compare Against Current Config

Read ~/.openclaw/openclaw.json and compare each agent's current model config against recommendations:

agents.list[].model.primary = Tier 1
agents.list[].model.fallbacks[0] = Tier 2
Remaining fallbacks = Tier 3 chain
agents.defaults.models = allowlist (must include all recommended models)

Flag only meaningful upgrades — don't recommend changes for marginal improvements.

Step 4: Write Audit Report

Save report to ~/.openclaw/workspace/knowledge/areas/model-audit-log.md:

# Model Audit Log

## [Date] Audit
**Audited by**: Rene
**Sources checked**: [list URLs]

### Recommendations
[Per-agent recommendations from Step 2]

### Summary
- Agents requiring updates: [count]
- Models to add to allowlist: [list]
- Estimated impact: [brief assessment]

Step 5: Send to Alfred for Approval Chain

Send audit results to Alfred via Telegram with:

Summary of recommended changes
Per-agent breakdown (only agents that need updates)
Clear instruction: "Please forward to Clayton for approval before applying."

Alfred's responsibility: Forward the suggestions to Clayton. If Clayton approves, Alfred updates openclaw.json with the new model stacks.

Important Notes

Never auto-apply model changes — always go through the approval chain
Models must be available on OpenRouter (prefix: openrouter/)
Verify models are actually available (not deprecated/removed) before recommending
Consider context window size — some tasks need large context (128K+)
Consider speed vs quality tradeoff per agent's typical workload
Free tier models should use the :free suffix on OpenRouter

View raw SKILL.md

# Model Audit Skill

## Purpose
Weekly audit of AI model landscape to ensure every agent runs the best available model for their domain. Maintains a 3-tier model stack per agent: best-in-class default, budget runner-up, and free failsafe.

## Model Stack Philosophy
Every agent gets a 3-tier model stack optimized for their specific domain:
- **Tier 1 (Default)**: The absolute best model on the market for that agent's domain — cost is irrelevant, quality is king
- **Tier 2 (Budget)**: Best runner-up budget model — activated if we're burning credits too fast
- **Tier 3 (Failsafe)**: Best free-tier model — last resort fallback to keep the agent running

## Agent Domain Mapping
| Agent | Domain | Optimization Priority |
|-------|--------|----------------------|
| Alfred (orchestrator) | General orchestration, multi-step planning, tool use | Reasoning + tool calling |
| Devin (delivery_ops) | Software engineering, code generation, debugging | Coding benchmarks (HumanEval, SWE-bench) |
| Rene (rnd) | Deep research, analysis, reasoning chains | Reasoning + knowledge (MMLU, GPQA) |
| Rick (revenue) | Sales writing, persuasion, communication | Writing quality + instruction following |
| Laura (legal) | Legal analysis, compliance, contract review | Reasoning + factual accuracy |
| Persephany (people) | HR, team evaluation, communication | Instruction following + empathy |
| Daniel (design) | UI/UX design, visual analysis, image understanding | Vision + multimodal benchmarks |
| Friedrich (finance) | Financial analysis, calculations, reporting | Math + reasoning (MATH, GSM8K) |

## Execution Steps

### Step 1: Research Current Leaderboards
Browse these sources for the latest model rankings:
1. **https://artificialanalysis.ai/leaderboards/models** — Overall quality, speed, and price comparison
2. **https://openrouter.ai/models** — Available models with pricing and capabilities
3. **https://lmarena.ai** — Chatbot Arena ELO rankings (human preference)
4. **https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard** — Open model benchmarks

For each agent domain, identify:
- Which model is currently #1 in that domain's benchmarks
- Which model offers best quality-per-dollar (budget tier)
- Which free model performs best in that domain

### Step 2: Build Recommended Stacks
For each agent, construct a 3-tier stack:
```
Agent: [name] ([id])
Domain: [domain]
Current Stack: [read from openclaw.json]
Recommended Stack:
  Tier 1 (Default): [model] — [reasoning: benchmark X score Y]
  Tier 2 (Budget):  [model] — [reasoning: cost vs quality tradeoff]
  Tier 3 (Failsafe): [model]:free — [reasoning: best free option for domain]
Changes: [list specific changes vs current config]
```

### Step 3: Compare Against Current Config
Read `~/.openclaw/openclaw.json` and compare each agent's current model config against recommendations:
- `agents.list[].model.primary` = Tier 1
- `agents.list[].model.fallbacks[0]` = Tier 2
- Remaining fallbacks = Tier 3 chain
- `agents.defaults.models` = allowlist (must include all recommended models)

Flag only meaningful upgrades — don't recommend changes for marginal improvements.

### Step 4: Write Audit Report
Save report to `~/.openclaw/workspace/knowledge/areas/model-audit-log.md`:
```markdown
# Model Audit Log

## [Date] Audit
**Audited by**: Rene
**Sources checked**: [list URLs]

### Recommendations
[Per-agent recommendations from Step 2]

### Summary
- Agents requiring updates: [count]
- Models to add to allowlist: [list]
- Estimated impact: [brief assessment]
```

### Step 5: Send to Alfred for Approval Chain
Send audit results to Alfred via Telegram with:
1. Summary of recommended changes
2. Per-agent breakdown (only agents that need updates)
3. Clear instruction: "Please forward to Clayton for approval before applying."

**Alfred's responsibility**: Forward the suggestions to Clayton. If Clayton approves, Alfred updates `openclaw.json` with the new model stacks.

## Important Notes
- **Never auto-apply model changes** — always go through the approval chain
- Models must be available on OpenRouter (prefix: `openrouter/`)
- Verify models are actually available (not deprecated/removed) before recommending
- Consider context window size — some tasks need large context (128K+)
- Consider speed vs quality tradeoff per agent's typical workload
- Free tier models should use the `:free` suffix on OpenRouter