Model board · daily advisory rundown

What the board recommended on 2026-05-30.

Frozen snapshot. A board of language models ran research and analysis tasks over the same daily data, then BurnBar reduced the result to deterministic selections with source citations. Benchmark signals are advisory — runtime constraints (provider-family mode, Exact Model Failover's canonical-ID gate, pinning, auth, quota, safety, availability) always win.

  • generated 18:16 UTC
  • 3 task categories
  • 3 sources
Daily Model Board

Rundown · 2026-05-30

loading live data… Generated Sat, 30 May 2026 18:16:28 GMT · schema v1 · model board · runtime constraints win

  • Artificial Analysis fresh
  • Terminal-Bench (via Hugging Face) fresh
  • Design Arena unavailable

A daily board of language models runs research and analysis tasks across the benchmark feed, then BurnBar reduces their findings to this deterministic recommendation. Benchmark data is advisory only; user pins, auth, quota, safety, availability, and exact-model failover rules still win at runtime.

  1. Coding

    Refactors, multi-file edits, repo-grounded code generation.

    Today's pick: GPT-5.5 xhigh — stable favorite rank #1 under 2026-05-13.stable-favorites; preferred reasoning effort xhigh; led the benchmark composite at 55/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.

    1. #1
      GPT-5.5 xhigh OpenAI · openai_compat
      99 selection / 100 evidence 59 · coverage 86%
      • bench55
      • fresh100
      • rel
      • latency43
      • cost22
      • ctx400k
      • availunknown
      • boardfavorite #1

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #1 receives a deterministic 12 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 55/100 across 6 sources.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      Claude Opus 4.7 Anthropic · anthropic
      96 selection / 100 evidence 58 · coverage 86%
      • bench53
      • fresh100
      • rel
      • latency45
      • cost18
      • ctx1M
      • availunknown
      • boardfavorite #2

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #2 receives a deterministic 8 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 53/100 across 2 sources.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    3. #3
      GLM 5.1 Z.ai · openai_compat
      93 selection / 100 evidence 53 · coverage 86%
      • bench40
      • fresh100
      • rel
      • latency60
      • cost66
      • ctx256k
      • availunknown
      • boardfavorite #3

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 40/100 across 2 sources.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the board pick 12 dropped
    • GPT-5.3 Codex · OpenAI

      Selection policy did not clear the leader's margin for this task.

      Evidence 57/100; selection 90/100 vs. leader 99/100.

    • GLM 5 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 57/100; selection 87/100 vs. leader 99/100.

    • DeepSeek V4 Pro · DeepSeek

      Selection policy did not clear the leader's margin for this task.

      Evidence 56/100; selection 84/100 vs. leader 99/100.

    • Gemini 3.1 Pro (preview) · Google

      Selection policy did not clear the leader's margin for this task.

      Evidence 55/100; selection 81/100 vs. leader 99/100.

    • Kimi K2.6 · Moonshot · Kimi

      Selection policy did not clear the leader's margin for this task.

      Evidence 55/100; selection 78/100 vs. leader 99/100.

    • MiniMax M2.7 · MiniMax

      Selection policy did not clear the leader's margin for this task.

      Evidence 55/100; selection 75/100 vs. leader 99/100.

    • Claude Sonnet 4.6 · Anthropic

      Selection policy did not clear the leader's margin for this task.

      Evidence 53/100; selection 72/100 vs. leader 99/100.

    • GLM 4.7 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 52/100; selection 69/100 vs. leader 99/100.

    • DeepSeek V4 Flash · DeepSeek

      Selection policy did not clear the leader's margin for this task.

      Evidence 52/100; selection 66/100 vs. leader 99/100.

    • Gemini 3 Flash · Google

      Selection policy did not clear the leader's margin for this task.

      Evidence 50/100; selection 63/100 vs. leader 99/100.

    • Kimi K2.5 · Moonshot · Kimi

      Selection policy did not clear the leader's margin for this task.

      Evidence 49/100; selection 60/100 vs. leader 99/100.

    • GPT-5.4 mini · OpenAI

      Selection policy did not clear the leader's margin for this task.

      Evidence 47/100; selection 57/100 vs. leader 99/100.

  2. Terminal

    Shell-loop agents that execute, observe, and self-correct.

    Today's pick: GLM 5.1 — stable favorite rank #3 under 2026-05-13.stable-favorites; led the benchmark composite at 66/100; evidence is fresh; cost is competitive; context window of 256k clears typical large-context work; runner-up DeepSeek V4 Pro is held in reserve for instant failover.

    1. #1
      GLM 5.1 Z.ai · openai_compat
      99 selection / 100 evidence 68 · coverage 86%
      • bench66
      • fresh100
      • rel65
      • latency
      • cost66
      • ctx256k
      • availunknown
      • boardfavorite #3

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 66/100 across 2 sources.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      DeepSeek V4 Pro DeepSeek · openai_compat
      96 selection / 100 evidence 70 · coverage 86%
      • bench68
      • fresh100
      • rel65
      • latency
      • cost72
      • ctx128k
      • availunknown

      Board verdict

      • Evidence score only; to outrank a protected favorite, a challenger must clear both evidence and benchmark dethroning margins across consecutive rundowns.
      • Composite benchmark score 68/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Cost-efficient at typical blended pricing.
      • Context window: 128k tokens.
      • Wire-format family: openai_compat.

      Source citations

    3. #3
      Kimi K2.6 Moonshot · Kimi · openai_compat
      93 selection / 100 evidence 68 · coverage 86%
      • bench67
      • fresh100
      • rel65
      • latency
      • cost60
      • ctx262k
      • availunknown

      Board verdict

      • Evidence score only; to outrank a protected favorite, a challenger must clear both evidence and benchmark dethroning margins across consecutive rundowns.
      • Composite benchmark score 67/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Context window: 262k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the board pick 5 dropped
    • MiniMax M2.7 · MiniMax

      Selection policy did not clear the leader's margin for this task.

      Evidence 63/100; selection 90/100 vs. leader 99/100.

    • DeepSeek V4 Flash · DeepSeek

      Selection policy did not clear the leader's margin for this task.

      Evidence 62/100; selection 87/100 vs. leader 99/100.

    • GLM 5 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 61/100; selection 84/100 vs. leader 99/100.

    • Kimi K2.5 · Moonshot · Kimi

      Selection policy did not clear the leader's margin for this task.

      Evidence 56/100; selection 81/100 vs. leader 99/100.

    • GLM 4.7 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 51/100; selection 78/100 vs. leader 99/100.

  3. General

    Mixed-intent chat / one-shot questions / catch-all routing.

    Today's pick: GPT-5.5 xhigh — stable favorite rank #1 under 2026-05-13.stable-favorites; preferred reasoning effort xhigh; led the benchmark composite at 53/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.

    1. #1
      GPT-5.5 xhigh OpenAI · openai_compat
      99 selection / 100 evidence 58 · coverage 86%
      • bench53
      • fresh100
      • rel
      • latency43
      • cost22
      • ctx400k
      • availunknown
      • boardfavorite #1

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #1 receives a deterministic 12 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 53/100 across 6 sources.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      Claude Opus 4.7 Anthropic · anthropic
      96 selection / 100 evidence 59 · coverage 86%
      • bench55
      • fresh100
      • rel
      • latency45
      • cost18
      • ctx1M
      • availunknown
      • boardfavorite #2

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #2 receives a deterministic 8 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 55/100 across 2 sources.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    3. #3
      GLM 5.1 Z.ai · openai_compat
      93 selection / 100 evidence 58 · coverage 86%
      • bench48
      • fresh100
      • rel
      • latency60
      • cost66
      • ctx256k
      • availunknown
      • boardfavorite #3

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 48/100 across 2 sources.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the board pick 12 dropped
    • GLM 5 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 60/100; selection 90/100 vs. leader 99/100.

    • MiniMax M2.7 · MiniMax

      Selection policy did not clear the leader's margin for this task.

      Evidence 59/100; selection 87/100 vs. leader 99/100.

    • Kimi K2.6 · Moonshot · Kimi

      Selection policy did not clear the leader's margin for this task.

      Evidence 58/100; selection 84/100 vs. leader 99/100.

    • DeepSeek V4 Pro · DeepSeek

      Selection policy did not clear the leader's margin for this task.

      Evidence 58/100; selection 81/100 vs. leader 99/100.

    • GPT-5.3 Codex · OpenAI

      Selection policy did not clear the leader's margin for this task.

      Evidence 58/100; selection 78/100 vs. leader 99/100.

    • Gemini 3.1 Pro (preview) · Google

      Selection policy did not clear the leader's margin for this task.

      Evidence 56/100; selection 75/100 vs. leader 99/100.

    • GLM 4.7 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 55/100; selection 72/100 vs. leader 99/100.

    • DeepSeek V4 Flash · DeepSeek

      Selection policy did not clear the leader's margin for this task.

      Evidence 55/100; selection 69/100 vs. leader 99/100.

    • Kimi K2.5 · Moonshot · Kimi

      Selection policy did not clear the leader's margin for this task.

      Evidence 54/100; selection 66/100 vs. leader 99/100.

    • Claude Sonnet 4.6 · Anthropic

      Selection policy did not clear the leader's margin for this task.

      Evidence 53/100; selection 63/100 vs. leader 99/100.

    • Gemini 3 Flash · Google

      Selection policy did not clear the leader's margin for this task.

      Evidence 51/100; selection 60/100 vs. leader 99/100.

    • GPT-5.4 mini · OpenAI

      Selection policy did not clear the leader's margin for this task.

      Evidence 46/100; selection 57/100 vs. leader 99/100.

What this rundown is — and isn't

  • Benchmark snapshots are advisory only — runtime constraints (provider-family mode, user pinning, auth, quota, safety, and availability) override any ranking shown here.
  • Displayed order uses stable favorite policy 2026-05-13.stable-favorites: GPT-5.5 xhigh, Claude Opus 4.7, then GLM 5.1 stay preferred while routable and freshly benchmarked; a challenger must beat both evidence and benchmark margins across consecutive rundowns to dethrone them.
  • BurnBar does not fabricate benchmark numbers. Missing data is reported as 'not reported', never guessed.
  • Daily snapshots are sampled from public or documented sources; raw provider keys, cookies, and bearer tokens are never written into snapshots or this rundown.
  • One or more sources were unavailable for this day; the rundown reflects only the sources that responded.

Operator notes

  • Generated by `node website/scripts/run-research.mjs` against live public benchmark adapters.
  • Snapshots from research: 1090. Catalog matches: 15.
  • Sources without an API key configured render as 'unavailable' — never guessed at.

Re-run today's routing locally.

Add an account, pick a model, and let the Fire Hydrant do the routing. Provider-family mode by default; Exact Model Failover when you want cross-provider recovery without changing the canonical model.