model-audit
Weekly audit of AI model performance, costs, and fallback health
Used by
Cron Jobs
Model Audit Skill
Purpose
Weekly audit of AI model landscape to ensure every agent runs the best available model for their domain. Maintains a 3-tier model stack per agent: best-in-class default, budget runner-up, and free failsafe.
Model Stack Philosophy
Every agent gets a 3-tier model stack optimized for their specific domain:
- Tier 1 (Default): The absolute best model on the market for that agent's domain — cost is irrelevant, quality is king
- Tier 2 (Budget): Best runner-up budget model — activated if we're burning credits too fast
- Tier 3 (Failsafe): Best free-tier model — last resort fallback to keep the agent running
Agent Domain Mapping
| Agent | Domain | Optimization Priority |
|---|---|---|
| Alfred (orchestrator) | General orchestration, multi-step planning, tool use | Reasoning + tool calling |
| Devin (delivery_ops) | Software engineering, code generation, debugging | Coding benchmarks (HumanEval, SWE-bench) |
| Rene (rnd) | Deep research, analysis, reasoning chains | Reasoning + knowledge (MMLU, GPQA) |
| Rick (revenue) | Sales writing, persuasion, communication | Writing quality + instruction following |
| Laura (legal) | Legal analysis, compliance, contract review | Reasoning + factual accuracy |
| Persephany (people) | HR, team evaluation, communication | Instruction following + empathy |
| Daniel (design) | UI/UX design, visual analysis, image understanding | Vision + multimodal benchmarks |
| Friedrich (finance) | Financial analysis, calculations, reporting | Math + reasoning (MATH, GSM8K) |
Execution Steps
Step 1: Research Current Leaderboards
Browse these sources for the latest model rankings:
- https://artificialanalysis.ai/leaderboards/models — Overall quality, speed, and price comparison
- https://openrouter.ai/models — Available models with pricing and capabilities
- https://lmarena.ai — Chatbot Arena ELO rankings (human preference)
- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard — Open model benchmarks
For each agent domain, identify:
- Which model is currently #1 in that domain's benchmarks
- Which model offers best quality-per-dollar (budget tier)
- Which free model performs best in that domain
Step 2: Build Recommended Stacks
For each agent, construct a 3-tier stack:
Agent: [name] ([id])
Domain: [domain]
Current Stack: [read from openclaw.json]
Recommended Stack:
Tier 1 (Default): [model] — [reasoning: benchmark X score Y]
Tier 2 (Budget): [model] — [reasoning: cost vs quality tradeoff]
Tier 3 (Failsafe): [model]:free — [reasoning: best free option for domain]
Changes: [list specific changes vs current config]
Step 3: Compare Against Current Config
Read ~/.openclaw/openclaw.json and compare each agent's current model config against recommendations:
agents.list[].model.primary= Tier 1agents.list[].model.fallbacks[0]= Tier 2- Remaining fallbacks = Tier 3 chain
agents.defaults.models= allowlist (must include all recommended models)
Flag only meaningful upgrades — don't recommend changes for marginal improvements.
Step 4: Write Audit Report
Save report to ~/.openclaw/workspace/knowledge/areas/model-audit-log.md:
# Model Audit Log
## [Date] Audit
**Audited by**: Rene
**Sources checked**: [list URLs]
### Recommendations
[Per-agent recommendations from Step 2]
### Summary
- Agents requiring updates: [count]
- Models to add to allowlist: [list]
- Estimated impact: [brief assessment]
Step 5: Send to Alfred for Approval Chain
Send audit results to Alfred via Telegram with:
- Summary of recommended changes
- Per-agent breakdown (only agents that need updates)
- Clear instruction: "Please forward to Clayton for approval before applying."
Alfred's responsibility: Forward the suggestions to Clayton. If Clayton approves, Alfred updates openclaw.json with the new model stacks.
Important Notes
- Never auto-apply model changes — always go through the approval chain
- Models must be available on OpenRouter (prefix:
openrouter/) - Verify models are actually available (not deprecated/removed) before recommending
- Consider context window size — some tasks need large context (128K+)
- Consider speed vs quality tradeoff per agent's typical workload
- Free tier models should use the
:freesuffix on OpenRouter
View raw SKILL.md
# Model Audit Skill ## Purpose Weekly audit of AI model landscape to ensure every agent runs the best available model for their domain. Maintains a 3-tier model stack per agent: best-in-class default, budget runner-up, and free failsafe. ## Model Stack Philosophy Every agent gets a 3-tier model stack optimized for their specific domain: - **Tier 1 (Default)**: The absolute best model on the market for that agent's domain — cost is irrelevant, quality is king - **Tier 2 (Budget)**: Best runner-up budget model — activated if we're burning credits too fast - **Tier 3 (Failsafe)**: Best free-tier model — last resort fallback to keep the agent running ## Agent Domain Mapping | Agent | Domain | Optimization Priority | |-------|--------|----------------------| | Alfred (orchestrator) | General orchestration, multi-step planning, tool use | Reasoning + tool calling | | Devin (delivery_ops) | Software engineering, code generation, debugging | Coding benchmarks (HumanEval, SWE-bench) | | Rene (rnd) | Deep research, analysis, reasoning chains | Reasoning + knowledge (MMLU, GPQA) | | Rick (revenue) | Sales writing, persuasion, communication | Writing quality + instruction following | | Laura (legal) | Legal analysis, compliance, contract review | Reasoning + factual accuracy | | Persephany (people) | HR, team evaluation, communication | Instruction following + empathy | | Daniel (design) | UI/UX design, visual analysis, image understanding | Vision + multimodal benchmarks | | Friedrich (finance) | Financial analysis, calculations, reporting | Math + reasoning (MATH, GSM8K) | ## Execution Steps ### Step 1: Research Current Leaderboards Browse these sources for the latest model rankings: 1. **https://artificialanalysis.ai/leaderboards/models** — Overall quality, speed, and price comparison 2. **https://openrouter.ai/models** — Available models with pricing and capabilities 3. **https://lmarena.ai** — Chatbot Arena ELO rankings (human preference) 4. **https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard** — Open model benchmarks For each agent domain, identify: - Which model is currently #1 in that domain's benchmarks - Which model offers best quality-per-dollar (budget tier) - Which free model performs best in that domain ### Step 2: Build Recommended Stacks For each agent, construct a 3-tier stack: ``` Agent: [name] ([id]) Domain: [domain] Current Stack: [read from openclaw.json] Recommended Stack: Tier 1 (Default): [model] — [reasoning: benchmark X score Y] Tier 2 (Budget): [model] — [reasoning: cost vs quality tradeoff] Tier 3 (Failsafe): [model]:free — [reasoning: best free option for domain] Changes: [list specific changes vs current config] ``` ### Step 3: Compare Against Current Config Read `~/.openclaw/openclaw.json` and compare each agent's current model config against recommendations: - `agents.list[].model.primary` = Tier 1 - `agents.list[].model.fallbacks[0]` = Tier 2 - Remaining fallbacks = Tier 3 chain - `agents.defaults.models` = allowlist (must include all recommended models) Flag only meaningful upgrades — don't recommend changes for marginal improvements. ### Step 4: Write Audit Report Save report to `~/.openclaw/workspace/knowledge/areas/model-audit-log.md`: ```markdown # Model Audit Log ## [Date] Audit **Audited by**: Rene **Sources checked**: [list URLs] ### Recommendations [Per-agent recommendations from Step 2] ### Summary - Agents requiring updates: [count] - Models to add to allowlist: [list] - Estimated impact: [brief assessment] ``` ### Step 5: Send to Alfred for Approval Chain Send audit results to Alfred via Telegram with: 1. Summary of recommended changes 2. Per-agent breakdown (only agents that need updates) 3. Clear instruction: "Please forward to Clayton for approval before applying." **Alfred's responsibility**: Forward the suggestions to Clayton. If Clayton approves, Alfred updates `openclaw.json` with the new model stacks. ## Important Notes - **Never auto-apply model changes** — always go through the approval chain - Models must be available on OpenRouter (prefix: `openrouter/`) - Verify models are actually available (not deprecated/removed) before recommending - Consider context window size — some tasks need large context (128K+) - Consider speed vs quality tradeoff per agent's typical workload - Free tier models should use the `:free` suffix on OpenRouter