Operator log · daily intelligent router rundown

What BurnBar recommended on 2026-05-13.

Frozen snapshot. Same data the router used to score requests that day, ordered by task and explained with source citations. Benchmark signals are advisory — runtime constraints (provider-family mode, pinning, auth, quota, safety, availability) always win.

  • generated 12:00 UTC
  • 5 task categories
  • 5 sources
Daily Intelligent Router Rundown

Rundown · 2026-05-13

loading live data… Generated Wed, 13 May 2026 12:00:00 GMT · schema v1 · benchmarks advisory · runtime constraints win

  • Artificial Analysis unavailable
  • Terminal-Bench (via Hugging Face) stale · 14h old
  • Design Arena stale · 42h old
  • Hugging Face fresh
  • Manual OpenBurnBar fixture fresh

Benchmark data is advisory only. Provider-family mode, user pinning, account auth, quota state, safety policy, and availability are evaluated at runtime and override any ranking shown here.

  1. Coding

    Refactors, multi-file edits, repo-grounded code generation.

    Today's pick: GPT-5.5 Codex — led the benchmark composite at 89/100; evidence is the freshest available, even though older than ideal; context window of 400k clears typical large-context work; runner-up Claude Opus 4.7 is held in reserve for instant failover.

    1. #1
      GPT-5.5 Codex OpenAI · openai_compat
      77 composite / 100 evidence 100%
      • bench89
      • fresh55
      • rel90
      • latency56
      • cost24
      • ctx400k
      • availcommon

      Why this rank

      • Composite benchmark score 89/100 across 1 source.
      • Freshest evidence rated 55/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      Claude Opus 4.7 Anthropic · anthropic
      76 composite / 100 evidence 100%
      • bench88
      • fresh55
      • rel88
      • latency46
      • cost18
      • ctx1M
      • availcommon

      Why this rank

      • Composite benchmark score 88/100 across 1 source.
      • Freshest evidence rated 55/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    3. #3
      GLM 5 Z.ai · openai_compat
      75 composite / 100 evidence 100%
      • bench82
      • fresh55
      • rel84
      • latency70
      • cost66
      • ctx256k
      • availcommon

      Why this rank

      • Composite benchmark score 82/100 across 1 source.
      • Freshest evidence rated 55/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Latency profile is fast (high TPS, low TTFT).
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the cut 7 dropped
    • GPT-5.5 · OpenAI

      Composite score did not clear the leader's margin for this task.

      Composite 75/100 vs. leader 77/100.

    • MiniMax 2.7 · MiniMax

      Composite score did not clear the leader's margin for this task.

      Composite 73/100 vs. leader 77/100.

    • Kimi 2.6 · Moonshot · Kimi

      Composite score did not clear the leader's margin for this task.

      Composite 72/100 vs. leader 77/100.

    • Claude Sonnet 4.6 · Anthropic

      Composite score did not clear the leader's margin for this task.

      Composite 70/100 vs. leader 77/100.

    • GPT-5.5 mini · OpenAI

      Composite score did not clear the leader's margin for this task.

      Composite 63/100 vs. leader 77/100.

    • Claude Haiku 4.5 · Anthropic

      Composite score did not clear the leader's margin for this task.

      Composite 60/100 vs. leader 77/100.

    • Gemini 3 Pro · Google

      Not routable through a connected BurnBar provider account.

      Composite 59/100 vs. leader 77/100.

  2. Terminal

    Shell-loop agents that execute, observe, and self-correct.

    Today's pick: GPT-5.5 Codex — led the benchmark composite at 86/100; evidence is fresh; context window of 400k clears typical large-context work; runner-up GPT-5.5 is held in reserve for instant failover.

    1. #1
      GPT-5.5 Codex OpenAI · openai_compat
      84 composite / 100 evidence 100%
      • bench86
      • fresh100
      • rel90
      • latency56
      • cost24
      • ctx400k
      • availcommon

      Why this rank

      • Composite benchmark score 86/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    2. #2
      GPT-5.5 OpenAI · openai_compat
      81 composite / 100 evidence 100%
      • bench82
      • fresh100
      • rel90
      • latency58
      • cost22
      • ctx400k
      • availcommon

      Why this rank

      • Composite benchmark score 82/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    3. #3
      GLM 5 Z.ai · openai_compat
      81 composite / 100 evidence 100%
      • bench77
      • fresh100
      • rel84
      • latency70
      • cost66
      • ctx256k
      • availcommon

      Why this rank

      • Composite benchmark score 77/100 across 1 source.
      • Freshest evidence rated 100/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Latency profile is fast (high TPS, low TTFT).
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the cut 6 dropped
    • MiniMax 2.7 · MiniMax

      Composite score did not clear the leader's margin for this task.

      Composite 79/100 vs. leader 84/100.

    • Claude Opus 4.7 · Anthropic

      Per-token cost is materially higher than the leader at comparable score.

      Composite 79/100 vs. leader 84/100.

    • Kimi 2.6 · Moonshot · Kimi

      Composite score did not clear the leader's margin for this task.

      Composite 77/100 vs. leader 84/100.

    • Claude Sonnet 4.6 · Anthropic

      Composite score did not clear the leader's margin for this task.

      Composite 74/100 vs. leader 84/100.

    • GPT-5.5 mini · OpenAI

      Composite score did not clear the leader's margin for this task.

      Composite 67/100 vs. leader 84/100.

    • Claude Haiku 4.5 · Anthropic

      Composite score did not clear the leader's margin for this task.

      Composite 64/100 vs. leader 84/100.

  3. Design

    Website / UI / SVG / slide generation evaluated head-to-head.

    Today's pick: Claude Opus 4.7 — led the benchmark composite at 84/100; evidence is fresh; context window of 1000k clears typical large-context work; runner-up GPT-5.5 is held in reserve for instant failover.

    1. #1
      Claude Opus 4.7 Anthropic · anthropic
      78 composite / 100 evidence 100%
      • bench84
      • fresh85
      • rel84
      • latency48
      • cost18
      • ctx1M
      • availcommon

      Why this rank

      • Composite benchmark score 84/100 across 1 source.
      • Freshest evidence rated 85/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    2. #2
      GPT-5.5 OpenAI · openai_compat
      77 composite / 100 evidence 100%
      • bench80
      • fresh85
      • rel88
      • latency58
      • cost22
      • ctx400k
      • availcommon

      Why this rank

      • Composite benchmark score 80/100 across 1 source.
      • Freshest evidence rated 85/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    3. #3
      GLM 5 Z.ai · openai_compat
      77 composite / 100 evidence 100%
      • bench75
      • fresh85
      • rel84
      • latency70
      • cost66
      • ctx256k
      • availcommon

      Why this rank

      • Composite benchmark score 75/100 across 1 source.
      • Freshest evidence rated 85/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Latency profile is fast (high TPS, low TTFT).
      • Context window: 256k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the cut 3 dropped
    • Kimi 2.6 · Moonshot · Kimi

      Composite score did not clear the leader's margin for this task.

      Composite 75/100 vs. leader 78/100.

    • Claude Sonnet 4.6 · Anthropic

      Composite score did not clear the leader's margin for this task.

      Composite 74/100 vs. leader 78/100.

    • Gemini 3 Pro · Google

      Not routable through a connected BurnBar provider account.

      Composite 63/100 vs. leader 78/100.

  4. Analysis

    Long-context reasoning, summarization, structured extraction.

    Today's pick: Claude Opus 4.7 — led the benchmark composite at 90/100; evidence is the freshest available, even though older than ideal; context window of 1000k clears typical large-context work; runner-up GPT-5.5 is held in reserve for instant failover.

    1. #1
      Claude Opus 4.7 Anthropic · anthropic
      77 composite / 100 evidence 100%
      • bench90
      • fresh55
      • rel86
      • latency48
      • cost18
      • ctx1M
      • availcommon

      Why this rank

      • Composite benchmark score 90/100 across 1 source.
      • Freshest evidence rated 55/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    2. #2
      GPT-5.5 OpenAI · openai_compat
      76 composite / 100 evidence 100%
      • bench88
      • fresh55
      • rel88
      • latency58
      • cost22
      • ctx400k
      • availcommon

      Why this rank

      • Composite benchmark score 88/100 across 1 source.
      • Freshest evidence rated 55/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    3. #3
      Claude Sonnet 4.6 Anthropic · anthropic
      72 composite / 100 evidence 100%
      • bench83
      • fresh55
      • rel86
      • latency60
      • cost42
      • ctx1M
      • availcommon

      Why this rank

      • Composite benchmark score 83/100 across 1 source.
      • Freshest evidence rated 55/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.
      • Tier · mid. Counted behind flagship siblings at equivalent benchmark; pin the tier explicitly to invert this.

      Source citations

    Why other candidates didn't make the cut 1 dropped
    • Gemini 3 Pro · Google

      Not routable through a connected BurnBar provider account.

      Composite 61/100 vs. leader 77/100.

  5. General

    Mixed-intent chat / one-shot questions / catch-all routing.

    Today's pick: Claude Opus 4.7 — led the benchmark composite at 88/100; evidence is the freshest available, even though older than ideal; context window of 1000k clears typical large-context work; runner-up GPT-5.5 is held in reserve for instant failover.

    1. #1
      Claude Opus 4.7 Anthropic · anthropic
      76 composite / 100 evidence 100%
      • bench88
      • fresh55
      • rel88
      • latency48
      • cost18
      • ctx1M
      • availcommon

      Why this rank

      • Composite benchmark score 88/100 across 1 source.
      • Freshest evidence rated 55/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 1000k tokens.
      • Wire-format family: anthropic.

      Source citations

    2. #2
      GPT-5.5 OpenAI · openai_compat
      76 composite / 100 evidence 100%
      • bench87
      • fresh55
      • rel88
      • latency58
      • cost22
      • ctx400k
      • availcommon

      Why this rank

      • Composite benchmark score 87/100 across 1 source.
      • Freshest evidence rated 55/100 — older sources are weighted down, not dropped.
      • Premium-tier per-token cost.
      • Latency is acceptable for non-interactive work.
      • Context window: 400k tokens.
      • Wire-format family: openai_compat.

      Source citations

    3. #3
      MiniMax 2.7 MiniMax · openai_compat
      74 composite / 100 evidence 100%
      • bench81
      • fresh55
      • rel82
      • latency68
      • cost62
      • ctx320k
      • availcommon

      Why this rank

      • Composite benchmark score 81/100 across 1 source.
      • Freshest evidence rated 55/100 — older sources are weighted down, not dropped.
      • Mid-tier per-token cost.
      • Latency profile is fast (high TPS, low TTFT).
      • Context window: 320k tokens.
      • Wire-format family: openai_compat.

      Source citations

    Why other candidates didn't make the cut 2 dropped
    • GPT-5.5 mini · OpenAI

      Composite score did not clear the leader's margin for this task.

      Composite 64/100 vs. leader 76/100.

    • Gemini 3 Pro · Google

      Not routable through a connected BurnBar provider account.

      Composite 61/100 vs. leader 76/100.

What this rundown is — and isn't

  • Benchmark snapshots are advisory only — runtime constraints (provider-family mode, user pinning, auth, quota, safety, and availability) override any ranking shown here.
  • BurnBar does not fabricate benchmark numbers. Missing data is reported as 'not reported', never guessed.
  • Daily snapshots are sampled from public or documented sources; raw provider keys, cookies, and bearer tokens are never written into snapshots or this rundown.
  • One or more sources were unavailable for this day; the rundown reflects only the sources that responded.

Operator notes

  • Static demo fixture for the website build (2026-05-13). Snapshots use real source attribution but freshness is reduced to 'manual' because no live API key is configured at static-build time.
  • Production daily ordering is generated by `refreshModelLandscapeBenchmarks` (functions/src/scheduled.ts) reading the Firestore `model_benchmark_snapshots` collection.
  • Ordering reflects each model's tier: flagship beats mid beats mini at equivalent benchmark — pin a tier explicitly to invert this.

Re-run today's routing locally.

Add an account, pick a model, and let the Fire Hydrant do the routing. Provider-family mode by default; intelligent mode opt-in.