Model board · daily advisory rundown

What the board recommended on 2026-05-13.

Frozen snapshot. A board of language models ran research and analysis tasks over the same daily data, then BurnBar reduced the result to deterministic selections with source citations. Benchmark signals are advisory — runtime constraints (provider-family mode, Exact Model Failover's canonical-ID gate, pinning, auth, quota, safety, availability) always win.

  • generated 15:43 UTC
  • 5 task categories
  • 4 sources
Daily Model Board

Rundown · 2026-05-13

loading live data… Generated Wed, 13 May 2026 15:43:14 GMT · schema v1 · model board · runtime constraints win

  • Artificial Analysis error
  • Terminal-Bench (via Hugging Face) fresh
  • Design Arena unavailable
  • Manual OpenBurnBar fixture fresh

A daily board of language models runs research and analysis tasks across the benchmark feed, then BurnBar reduces their findings to this deterministic recommendation. Benchmark data is advisory only; user pins, auth, quota, safety, availability, and exact-model failover rules still win at runtime.

  1. Coding

    Refactors, multi-file edits, repo-grounded code generation.

    Today's pick: GPT-5.5 xhigh — stable favorite rank #1 under 2026-05-13.stable-favorites; preferred reasoning effort xhigh; led the benchmark composite at 86/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.

    1. #1
      GPT-5.5 xhigh OpenAI · openai_compat
      99 selection / 100 evidence 75 · coverage 71%
      • bench86
      • fresh100
      • rel
      • latency
      • cost22
      • ctx400k
      • availunknown
      • boardfavorite #1

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #1 receives a deterministic 12 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 86/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      Claude Opus 4.7 Anthropic · anthropic
      96 selection / 100 evidence 76 · coverage 71%
      • bench88
      • fresh100
      • rel
      • latency
      • cost18
      • ctx1M
      • availunknown
      • boardfavorite #2

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #2 receives a deterministic 8 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 88/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    3. #3
      GLM 5.1 Z.ai · openai_compat
      93 selection / 100 evidence 78 · coverage 71%
      • bench83
      • fresh100
      • rel
      • latency
      • cost66
      • ctx256k
      • availunknown
      • boardfavorite #3

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 83/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the board pick 5 dropped
    • GLM 5 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 76/100; selection 90/100 vs. leader 99/100.

    • Gemini 3.1 Pro (preview) · Google

      Selection policy did not clear the leader's margin for this task.

      Evidence 75/100; selection 87/100 vs. leader 99/100.

    • MiniMax M2.7 · MiniMax

      Selection policy did not clear the leader's margin for this task.

      Evidence 74/100; selection 84/100 vs. leader 99/100.

    • Kimi K2.6 · Moonshot · Kimi

      Selection policy did not clear the leader's margin for this task.

      Evidence 72/100; selection 81/100 vs. leader 99/100.

    • Claude Sonnet 4.6 · Anthropic

      Selection policy did not clear the leader's margin for this task.

      Evidence 70/100; selection 78/100 vs. leader 99/100.

  2. Terminal

    Shell-loop agents that execute, observe, and self-correct.

    Today's pick: GPT-5.5 xhigh — stable favorite rank #1 under 2026-05-13.stable-favorites; preferred reasoning effort xhigh; led the benchmark composite at 82/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.

    1. #1
      GPT-5.5 xhigh OpenAI · openai_compat
      99 selection / 100 evidence 75 · coverage 71%
      • bench82
      • fresh100
      • rel
      • latency
      • cost22
      • ctx400k
      • availunknown
      • boardfavorite #1

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #1 receives a deterministic 12 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 82/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      Claude Opus 4.7 Anthropic · anthropic
      96 selection / 100 evidence 73 · coverage 71%
      • bench79
      • fresh100
      • rel
      • latency
      • cost18
      • ctx1M
      • availunknown
      • boardfavorite #2

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #2 receives a deterministic 8 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 79/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    3. #3
      GLM 5.1 Z.ai · openai_compat
      93 selection / 100 evidence 71 · coverage 86%
      • bench70
      • fresh100
      • rel65
      • latency
      • cost66
      • ctx256k
      • availunknown
      • boardfavorite #3

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 70/100 across 3 sources.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the board pick 8 dropped
    • Kimi K2.6 · Moonshot · Kimi

      Selection policy did not clear the leader's margin for this task.

      Evidence 70/100; selection 90/100 vs. leader 99/100.

    • DeepSeek V4 Pro · DeepSeek

      Selection policy did not clear the leader's margin for this task.

      Evidence 70/100; selection 87/100 vs. leader 99/100.

    • MiniMax M2.7 · MiniMax

      Selection policy did not clear the leader's margin for this task.

      Evidence 69/100; selection 84/100 vs. leader 99/100.

    • Claude Sonnet 4.6 · Anthropic

      Selection policy did not clear the leader's margin for this task.

      Evidence 68/100; selection 81/100 vs. leader 99/100.

    • GLM 5 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 68/100; selection 78/100 vs. leader 99/100.

    • DeepSeek V4 Flash · DeepSeek

      Selection policy did not clear the leader's margin for this task.

      Evidence 62/100; selection 75/100 vs. leader 99/100.

    • Kimi K2.5 · Moonshot · Kimi

      Selection policy did not clear the leader's margin for this task.

      Evidence 56/100; selection 72/100 vs. leader 99/100.

    • GLM 4.7 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 51/100; selection 69/100 vs. leader 99/100.

  3. Design

    Website / UI / SVG / slide generation evaluated head-to-head.

    Today's pick: GPT-5.5 xhigh — stable favorite rank #1 under 2026-05-13.stable-favorites; preferred reasoning effort xhigh; led the benchmark composite at 80/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.

    1. #1
      GPT-5.5 xhigh OpenAI · openai_compat
      99 selection / 100 evidence 73 · coverage 71%
      • bench80
      • fresh100
      • rel
      • latency
      • cost22
      • ctx400k
      • availunknown
      • boardfavorite #1

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #1 receives a deterministic 12 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 80/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      Claude Opus 4.7 Anthropic · anthropic
      96 selection / 100 evidence 75 · coverage 71%
      • bench84
      • fresh100
      • rel
      • latency
      • cost18
      • ctx1M
      • availunknown
      • boardfavorite #2

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #2 receives a deterministic 8 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 84/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    3. #3
      GLM 5.1 Z.ai · openai_compat
      93 selection / 100 evidence 73 · coverage 71%
      • bench76
      • fresh100
      • rel
      • latency
      • cost66
      • ctx256k
      • availunknown
      • boardfavorite #3

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 76/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the board pick 4 dropped
    • Gemini 3.1 Pro (preview) · Google

      Selection policy did not clear the leader's margin for this task.

      Evidence 75/100; selection 90/100 vs. leader 99/100.

    • GLM 5 · Z.ai

      Selection policy did not clear the leader's margin for this task.

      Evidence 73/100; selection 87/100 vs. leader 99/100.

    • Kimi K2.6 · Moonshot · Kimi

      Selection policy did not clear the leader's margin for this task.

      Evidence 71/100; selection 84/100 vs. leader 99/100.

    • Claude Sonnet 4.6 · Anthropic

      Selection policy did not clear the leader's margin for this task.

      Evidence 70/100; selection 81/100 vs. leader 99/100.

  4. Analysis

    Long-context reasoning, summarization, structured extraction.

    Today's pick: GPT-5.5 xhigh — stable favorite rank #1 under 2026-05-13.stable-favorites; preferred reasoning effort xhigh; led the benchmark composite at 88/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.

    1. #1
      GPT-5.5 xhigh OpenAI · openai_compat
      99 selection / 100 evidence 77 · coverage 71%
      • bench88
      • fresh100
      • rel
      • latency
      • cost22
      • ctx400k
      • availunknown
      • boardfavorite #1

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #1 receives a deterministic 12 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 88/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      Claude Opus 4.7 Anthropic · anthropic
      96 selection / 100 evidence 78 · coverage 71%
      • bench90
      • fresh100
      • rel
      • latency
      • cost18
      • ctx1M
      • availunknown
      • boardfavorite #2

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #2 receives a deterministic 8 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 90/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    3. #3
      GLM 5.1 Z.ai · openai_compat
      93 selection / 100 evidence 79 · coverage 71%
      • bench86
      • fresh100
      • rel
      • latency
      • cost66
      • ctx256k
      • availunknown
      • boardfavorite #3

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 86/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the board pick 2 dropped
    • Gemini 3.1 Pro (preview) · Google

      Selection policy did not clear the leader's margin for this task.

      Evidence 78/100; selection 90/100 vs. leader 99/100.

    • Claude Sonnet 4.6 · Anthropic

      Selection policy did not clear the leader's margin for this task.

      Evidence 72/100; selection 87/100 vs. leader 99/100.

  5. General

    Mixed-intent chat / one-shot questions / catch-all routing.

    Today's pick: GPT-5.5 xhigh — stable favorite rank #1 under 2026-05-13.stable-favorites; preferred reasoning effort xhigh; led the benchmark composite at 87/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.

    1. #1
      GPT-5.5 xhigh OpenAI · openai_compat
      99 selection / 100 evidence 76 · coverage 71%
      • bench87
      • fresh100
      • rel
      • latency
      • cost22
      • ctx400k
      • availunknown
      • boardfavorite #1

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #1 receives a deterministic 12 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 87/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      Claude Opus 4.7 Anthropic · anthropic
      96 selection / 100 evidence 76 · coverage 71%
      • bench88
      • fresh100
      • rel
      • latency
      • cost18
      • ctx1M
      • availunknown
      • boardfavorite #2

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #2 receives a deterministic 8 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 88/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    3. #3
      GLM 5.1 Z.ai · openai_compat
      93 selection / 100 evidence 78 · coverage 71%
      • bench84
      • fresh100
      • rel
      • latency
      • cost66
      • ctx256k
      • availunknown
      • boardfavorite #3

      Board verdict

      • Stable favorite policy 2026-05-13.stable-favorites: favorite rank #3 receives a deterministic 5 point prior until a challenger clears both dethroning margins on consecutive rundowns; the final selection score is calibrated after policy ordering so the public number matches the chosen rank.
      • Composite benchmark score 84/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the board pick 2 dropped
    • Gemini 3.1 Pro (preview) · Google

      Selection policy did not clear the leader's margin for this task.

      Evidence 77/100; selection 90/100 vs. leader 99/100.

    • MiniMax M2.7 · MiniMax

      Selection policy did not clear the leader's margin for this task.

      Evidence 75/100; selection 87/100 vs. leader 99/100.

What this rundown is — and isn't

  • Benchmark snapshots are advisory only — runtime constraints (provider-family mode, user pinning, auth, quota, safety, and availability) override any ranking shown here.
  • Displayed order uses stable favorite policy 2026-05-13.stable-favorites: GPT-5.5 xhigh, Claude Opus 4.7, then GLM 5.1 stay preferred while routable and freshly benchmarked; a challenger must beat both evidence and benchmark margins across consecutive rundowns to dethrone them.
  • BurnBar does not fabricate benchmark numbers. Missing data is reported as 'not reported', never guessed.
  • Daily snapshots are sampled from public or documented sources; raw provider keys, cookies, and bearer tokens are never written into snapshots or this rundown.
  • One or more sources were unavailable for this day; the rundown reflects only the sources that responded.

Operator notes

  • Generated by `node website/scripts/run-research.mjs` against live public benchmark adapters.
  • Snapshots from research: 74. Catalog matches: 12.
  • Sources without an API key configured render as 'unavailable' — never guessed at.

Re-run today's routing locally.

Add an account, pick a model, and let the Fire Hydrant do the routing. Provider-family mode by default; Exact Model Failover when you want cross-provider recovery without changing the canonical model.