The Bleeding Edge

// Article · May 9, 2026

SubQ and the end of the transformer's memory tax

A new architecture claims to make 12-million-token context cheap. Half the AI tooling industry is selling you a workaround for a tax that might be about to disappear.

from 2026-W19subquadraticattention-architecturesraglong-contextfrontier-research

Every AI engineering pattern of the last three years — RAG, vector embeddings, agent frameworks, sub-agent orchestration, MIT's Recursive Language Models, Claude's CLAUDE.md, OpenAI's memory tool — was invented to dodge one fact: standard transformer attention scales O(n²). Double the input, quadruple the cost. Nobody could afford to let the model just read everything, so the entire industry built scaffolding to read a little, search a little, summarise a little, pray a little.

This week, a Miami startup called Subquadratic came out of stealth and claimed it has built the first commercial frontier LLM where reading everything is suddenly cheap. The 12-million-token context window is the headline. The architecture underneath it is the actual story.

The Core Claim

SubQ's architecture (SSA — Subquadratic Selective Attention) scales linearly with input length. The performance numbers, all from Subquadratic's own benchmarks:

Benchmark SubQ Claude Opus 4.6 GPT-5.4 Gemini 3.1 Pro
RULER 128K (long-context accuracy) 97% 94%
Cost to run RULER 128K $8 ~$2,600
MRCR v2 (multi-needle retrieval) 83 78 39 23
12M-token recall 92% (out of reach) (out of reach) (out of reach)
Speed at 1M tokens vs FlashAttention 52x faster
SWE-Bench 81.8% (Opus 4.7: 87.6%)

Two products live today: a 12M-token API, and SubQ Code — a CLI agent designed around the assumption that it can load your entire repo in one pass.

What "Memory Tax" Actually Means

To make today's frontier models behave as if they had a long memory, the industry built five generations of workaround:

  1. RAG (retrieval-augmented generation). Break the document into chunks, embed them as vectors, store them, retrieve the most similar chunks at query time. The agent reads 5–20 chunks, not the whole doc.
  2. Agent frameworks (CrewAI, LangGraph, sub-agents). Split tasks into specialists who pass notes to each other. Each agent sees only its slice.
  3. MIT's Recursive Language Models. Hand the prompt to the model as a file it writes Python code to search. The model is now its own retrieval engine.
  4. Claude Managed Agents memory. Give the agent a folder where it saves notes between sessions; load them at start.
  5. Cross-session files (CLAUDE.md, MEMORY.md, /memories). Persistent state outside the model so the agent doesn't have to relearn you every time.

Each of these is engineering around the same fact: standard attention is too expensive to read everything at once. If SubQ's architecture holds at scale, layers 1–4 stop being the only way. Layer 5 (cross-session memory) is solving a different problem and stays useful.

The Cost Math That Actually Drives Adoption

The single most striking number from Subquadratic's benchmarks isn't the 12M context. It's the cost. At RULER 128K, SubQ runs the same evaluation for $8 that costs frontier models roughly $2,600. That's a ~325× difference. At full 12M tokens, the company claims compute requirements drop by approximately 1,000× compared to other frontier models attempting equivalent context.

In product terms: today, loading a 4-million-token monorepo into Claude or GPT is something you don't do — you tell the agent what files matter and let it ask for the rest. With SubQ, that becomes "load the repo, ask the question." The interaction model changes.

The Honest Take

We have heard "this kills the transformer" before. The graveyard:

  • Mamba (2024). State-space model, sub-quadratic. Promised the same thing. Underperformed at frontier scale and got absorbed back into hybrid architectures.
  • RWKV. RNN-Transformer hybrid. Same promise. Same outcome.
  • DeepSeek Sparse Attention. Compresses the KV cache 98%. Excellent for memory, doesn't actually solve the linear-scaling-without-quality-loss problem.
  • Magic.dev (August 2024). Announced LTM-2-mini at 100M tokens with a 1,000× efficiency advantage. Raised ~$500M on the strength of it. As of early 2026, no public evidence of LTM-2-mini being used outside the lab. Has not shipped a competitive frontier model.

What's different about SubQ: live API access today (Magic never gave anyone this), PhDs from Meta / Google / Oxford / Cambridge, two products you can buy. What's the same: the benchmarks are first-party, and the architecture is novel enough that researchers (per VentureBeat) are publicly demanding independent proof. Claude Opus 4.7 still leads SWE-Bench at 87.6% vs SubQ's 81.8% — frontier capability and frontier context are not yet the same model.

What This Means

For builders: try SubQ Code on a side project, don't migrate Claude Code production this week. The right way to evaluate this is with workloads where context is the actual binding constraint — large monorepos, multi-document research, full-conversation analysis — and to compare cost AND quality against your existing setup. The 12M context only pays back if your workflow needs it. If you're already happy at 200K, this is a thing to watch, not a thing to build on.

For everyone else: the meta-call here is whether the industry is about to spend the next 12 months un-engineering RAG. If SubQ's claims hold, every "we built our own retrieval pipeline" engineer just got pulled into a re-architecture conversation. If they don't hold, this becomes Magic 2.0 — well-funded, well-credentialed, technically interesting, commercially silent.

Half the AI tooling industry is selling you a workaround for the memory tax. If the tax goes to zero, half the industry is selling a solution to a solved problem.

Alternatives Worth Naming

  • MIT Recursive Language Models — hand the prompt to the model as a file it can search with code. Same problem, opposite philosophy.
  • DeepSeek V4 sparse attention + heavily compressed attention — 98% KV-cache reduction without the architectural rewrite. Open-weights, no hype cycle.
  • Ling-2.6 Flash (Chinese open-source) — 104B MoE / 7.4B active, agent-optimised long-context.
  • Cross-session memory (Claude /memories, CLAUDE.md) — solves a different problem (continuity over time), still useful regardless of what wins on attention scaling.

A note worth saying out loud: don't conflate the SubQ architecture story with the SubQ company story. Even if the architecture wins, Subquadratic might not be the company that productises it. Mamba's authors didn't end up running Mamba Inc.