Model board · daily advisory rundown

What the board recommended on 2026-06-03.

Frozen snapshot. A board of language models ran research and analysis tasks over the same daily data, then BurnBar reduced the result to deterministic selections with source citations. Benchmark signals are advisory — runtime constraints (provider-family mode, Exact Model Failover's canonical-ID gate, pinning, auth, quota, safety, availability) always win.

generated 06:57 UTC
3 task categories
3 sources

Daily Model Board

Rundown · 2026-06-03

loading live data… Generated Wed, 03 Jun 2026 06:57:32 GMT · schema v1 · model board · runtime constraints win

Artificial Analysis fresh
Terminal-Bench (via Hugging Face) fresh
Design Arena unavailable

A daily board of language models runs research and analysis tasks across the benchmark feed, then BurnBar reduces their findings to this deterministic recommendation. Benchmark data is advisory only; user pins, auth, quota, safety, availability, and exact-model failover rules still win at runtime.

Coding
Refactors, multi-file edits, repo-grounded code generation.

Today's pick: GPT-5.5 xhigh — stable favorite rank #1 under 2026-05-13.stable-favorites; preferred reasoning effort xhigh; led the benchmark composite at 55/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.
1. #1
  GPT-5.5 xhigh OpenAI · openai_compat
  
  99 selection / 100 evidence 59 · coverage 86%
  - bench55
  - fresh100
  - rel—
  - latency44
  - cost22
  - ctx400k
  - availunknown
  - boardfavorite #1
  Board verdict
  - Stable favorite policy 2026-05-13.stable-favorites: favorite rank #1 receives a deterministic 12 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
  - Composite benchmark score 55/100 across 6 sources.
  - Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
  - Premium-tier per-token cost.
  - Latency is acceptable for non-interactive work.
  - Context window: 400k tokens.
  - Wire-format family: openai_compat.
  Source citations
  - Artificial Analysis
    score 59 1m old fresh
  - Artificial Analysis
    score 49 1m old fresh
  - Artificial Analysis
    score 56 1m old fresh
  - Artificial Analysis
    score 59 1m old fresh
  - Artificial Analysis
    1m old fresh
  - Artificial Analysis
    score 52 1m old fresh
2. #2
  Claude Opus 4.7 Anthropic · anthropic
  
  96 selection / 100 evidence 58 · coverage 86%
  - bench53
  - fresh100
  - rel—
  - latency44
  - cost18
  - ctx1M
  - availunknown
  - boardfavorite #2
  Board verdict
  - Stable favorite policy 2026-05-13.stable-favorites: favorite rank #2 receives a deterministic 8 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
  - Composite benchmark score 53/100 across 2 sources.
  - Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
  - Premium-tier per-token cost.
  - Latency is acceptable for non-interactive work.
  - Context window: 1000k tokens.
  - Wire-format family: anthropic.
  Source citations
  - Artificial Analysis
    score 53 1m old fresh
  - Artificial Analysis
    score 53 1m old fresh
3. #3
  GLM 5.1 Z.ai · openai_compat
  
  93 selection / 100 evidence 53 · coverage 86%
  - bench40
  - fresh100
  - rel—
  - latency59
  - cost66
  - ctx256k
  - availunknown
  - boardfavorite #3
  Board verdict
  - Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
  - Composite benchmark score 40/100 across 2 sources.
  - Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
  - Mid-tier per-token cost.
  - Latency is acceptable for non-interactive work.
  - Context window: 256k tokens.
  - Wire-format family: openai_compat.
  Source citations
  - Artificial Analysis
    score 43 1m old fresh
  - Artificial Analysis
    score 36 1m old fresh
Why other candidates didn't make the board pick 12 dropped
- GPT-5.3 Codex · OpenAI
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 57/100; selection 90/100 vs. leader 99/100.
- GLM 5 · Z.ai
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 57/100; selection 87/100 vs. leader 99/100.
- DeepSeek V4 Pro · DeepSeek
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 56/100; selection 84/100 vs. leader 99/100.
- Gemini 3.1 Pro (preview) · Google
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 55/100; selection 81/100 vs. leader 99/100.
- Kimi K2.6 · Moonshot · Kimi
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 55/100; selection 78/100 vs. leader 99/100.
- MiniMax M2.7 · MiniMax
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 55/100; selection 75/100 vs. leader 99/100.
- Claude Sonnet 4.6 · Anthropic
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 53/100; selection 72/100 vs. leader 99/100.
- GLM 4.7 · Z.ai
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 52/100; selection 69/100 vs. leader 99/100.
- DeepSeek V4 Flash · DeepSeek
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 52/100; selection 66/100 vs. leader 99/100.
- Gemini 3 Flash · Google
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 50/100; selection 63/100 vs. leader 99/100.
- Kimi K2.5 · Moonshot · Kimi
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 49/100; selection 60/100 vs. leader 99/100.
- GPT-5.4 mini · OpenAI
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 47/100; selection 57/100 vs. leader 99/100.
Terminal
Shell-loop agents that execute, observe, and self-correct.

Today's pick: GLM 5.1 — stable favorite rank #3 under 2026-05-13.stable-favorites; led the benchmark composite at 66/100; evidence is fresh; cost is competitive; context window of 256k clears typical large-context work; runner-up DeepSeek V4 Pro is held in reserve for instant failover.
1. #1
  GLM 5.1 Z.ai · openai_compat
  
  99 selection / 100 evidence 68 · coverage 86%
  - bench66
  - fresh100
  - rel65
  - latency—
  - cost66
  - ctx256k
  - availunknown
  - boardfavorite #3
  Board verdict
  - Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
  - Composite benchmark score 66/100 across 2 sources.
  - Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
  - Mid-tier per-token cost.
  - Context window: 256k tokens.
  - Wire-format family: openai_compat.
  Source citations
  - Terminal-Bench (via Hugging Face)
    rank 1 score 69 1m old fresh
  - Terminal-Bench (via Hugging Face)
    rank 6 score 64 1m old fresh
2. #2
  DeepSeek V4 Pro DeepSeek · openai_compat
  
  96 selection / 100 evidence 70 · coverage 86%
  - bench68
  - fresh100
  - rel65
  - latency—
  - cost72
  - ctx128k
  - availunknown
  Board verdict
  - Evidence score only; to outrank a protected favorite, a challenger must clear both evidence and benchmark dethroning margins across consecutive rundowns.
  - Composite benchmark score 68/100 across 1 source.
  - Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
  - Cost-efficient at typical blended pricing.
  - Context window: 128k tokens.
  - Wire-format family: openai_compat.
  Source citations
  - Terminal-Bench (via Hugging Face)
    rank 3 score 68 1m old fresh
3. #3
  Kimi K2.6 Moonshot · Kimi · openai_compat
  
  93 selection / 100 evidence 68 · coverage 86%
  - bench67
  - fresh100
  - rel65
  - latency—
  - cost60
  - ctx262k
  - availunknown
  Board verdict
  - Evidence score only; to outrank a protected favorite, a challenger must clear both evidence and benchmark dethroning margins across consecutive rundowns.
  - Composite benchmark score 67/100 across 1 source.
  - Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
  - Mid-tier per-token cost.
  - Context window: 262k tokens.
  - Wire-format family: openai_compat.
  Source citations
  - Terminal-Bench (via Hugging Face)
    rank 4 score 67 1m old fresh
Why other candidates didn't make the board pick 5 dropped
- MiniMax M2.7 · MiniMax
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 63/100; selection 90/100 vs. leader 99/100.
- DeepSeek V4 Flash · DeepSeek
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 62/100; selection 87/100 vs. leader 99/100.
- GLM 5 · Z.ai
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 61/100; selection 84/100 vs. leader 99/100.
- Kimi K2.5 · Moonshot · Kimi
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 56/100; selection 81/100 vs. leader 99/100.
- GLM 4.7 · Z.ai
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 51/100; selection 78/100 vs. leader 99/100.
General
Mixed-intent chat / one-shot questions / catch-all routing.

Today's pick: GPT-5.5 xhigh — stable favorite rank #1 under 2026-05-13.stable-favorites; preferred reasoning effort xhigh; led the benchmark composite at 53/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.
1. #1
  GPT-5.5 xhigh OpenAI · openai_compat
  
  99 selection / 100 evidence 58 · coverage 86%
  - bench53
  - fresh100
  - rel—
  - latency44
  - cost22
  - ctx400k
  - availunknown
  - boardfavorite #1
  Board verdict
  - Stable favorite policy 2026-05-13.stable-favorites: favorite rank #1 receives a deterministic 12 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
  - Composite benchmark score 53/100 across 6 sources.
  - Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
  - Premium-tier per-token cost.
  - Latency is acceptable for non-interactive work.
  - Context window: 400k tokens.
  - Wire-format family: openai_compat.
  Source citations
  - Artificial Analysis
    score 59 1m old fresh
  - Artificial Analysis
    score 41 1m old fresh
  - Artificial Analysis
    score 57 1m old fresh
  - Artificial Analysis
    score 60 1m old fresh
  - Artificial Analysis
    1m old fresh
  - Artificial Analysis
    score 51 1m old fresh
2. #2
  Claude Opus 4.7 Anthropic · anthropic
  
  96 selection / 100 evidence 59 · coverage 86%
  - bench55
  - fresh100
  - rel—
  - latency44
  - cost18
  - ctx1M
  - availunknown
  - boardfavorite #2
  Board verdict
  - Stable favorite policy 2026-05-13.stable-favorites: favorite rank #2 receives a deterministic 8 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
  - Composite benchmark score 55/100 across 2 sources.
  - Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
  - Premium-tier per-token cost.
  - Latency is acceptable for non-interactive work.
  - Context window: 1000k tokens.
  - Wire-format family: anthropic.
  Source citations
  - Artificial Analysis
    score 52 1m old fresh
  - Artificial Analysis
    score 57 1m old fresh
3. #3
  GLM 5.1 Z.ai · openai_compat
  
  93 selection / 100 evidence 58 · coverage 86%
  - bench48
  - fresh100
  - rel—
  - latency59
  - cost66
  - ctx256k
  - availunknown
  - boardfavorite #3
  Board verdict
  - Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
  - Composite benchmark score 48/100 across 2 sources.
  - Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
  - Mid-tier per-token cost.
  - Latency is acceptable for non-interactive work.
  - Context window: 256k tokens.
  - Wire-format family: openai_compat.
  Source citations
  - Artificial Analysis
    score 51 1m old fresh
  - Artificial Analysis
    score 44 1m old fresh
Why other candidates didn't make the board pick 12 dropped
- GLM 5 · Z.ai
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 60/100; selection 90/100 vs. leader 99/100.
- MiniMax M2.7 · MiniMax
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 59/100; selection 87/100 vs. leader 99/100.
- Kimi K2.6 · Moonshot · Kimi
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 58/100; selection 84/100 vs. leader 99/100.
- DeepSeek V4 Pro · DeepSeek
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 58/100; selection 81/100 vs. leader 99/100.
- GPT-5.3 Codex · OpenAI
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 58/100; selection 78/100 vs. leader 99/100.
- Gemini 3.1 Pro (preview) · Google
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 56/100; selection 75/100 vs. leader 99/100.
- GLM 4.7 · Z.ai
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 56/100; selection 72/100 vs. leader 99/100.
- DeepSeek V4 Flash · DeepSeek
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 55/100; selection 69/100 vs. leader 99/100.
- Kimi K2.5 · Moonshot · Kimi
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 54/100; selection 66/100 vs. leader 99/100.
- Claude Sonnet 4.6 · Anthropic
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 53/100; selection 63/100 vs. leader 99/100.
- Gemini 3 Flash · Google
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 51/100; selection 60/100 vs. leader 99/100.
- GPT-5.4 mini · OpenAI
  Selection policy did not clear the leader's margin for this task.
  
  Evidence 46/100; selection 57/100 vs. leader 99/100.

Re-run today's routing locally.

Add an account, pick a model, and let the Fire Hydrant do the routing. Provider-family mode by default; Exact Model Failover when you want cross-provider recovery without changing the canonical model.

Download for macOS Read the gateway doc

What the board recommended on 2026-06-03.

Board verdict

Source citations

Board verdict

Source citations

Board verdict

Source citations

Board verdict

Source citations

Board verdict

Source citations

Board verdict

Source citations

Board verdict

Source citations

Board verdict

Source citations

Board verdict

Source citations

Re-run today's routing locally.