The Bleeding Edge

// Article · May 9, 2026

The five layers of AI agent memory

Why coding agents still have the 50 First Dates problem — and the orchestration stack that fixes it

from 2026-W18agentsmemoryclaude-codemulti-llmbeadsmem0practitioner-craft

Every coding agent in 2026 still has the "50 First Dates" problem. You can have a four-hour productive session with Claude Code, watch it learn your codebase, build genuine momentum — and tomorrow morning it starts from zero. The instruction file you wrote helps. The auto-memory it accumulates helps. But the fundamental amnesia is still there, and it gets worse the moment a second developer or a second tool enters the picture.

The clever framing: this isn't really a memory problem. It's five different memory problems pretending to be one. And conflating them is why most teams' "AI memory strategy" is a single bloated CLAUDE.md file that everyone has stopped reading.


Part 1: Memory

The Core Claim

Memory for coding agents is five distinct concerns. Each has its own home, its own format, and its own update cadence. Skipping any one creates a category of pain the others can't fix.

Layer What it holds Where it lives Cadence
L1: Project rules Stack, conventions, "always do X" AGENTS.md in repo root, committed Weekly / on-PR
L2: Tool-specific config MCP setup, slash commands, tool features CLAUDE.md, GEMINI.md, .github/copilot-instructions.md Rare
L3: Personal context Individual preferences, sandbox URLs *.local.md (gitignored), user-global ~/.<tool>/... When it bugs you
L4: Task & dependency state What to work on, what's blocked, who claimed what Beads (.beads/) committed to repo Every session
L5: Cross-session knowledge Decisions, learned facts, "we tried X and it broke" Mem0 (or Zep/Letta) via MCP Continuous

L1 + L2 are instructions. L3 is personal overrides. L4 is work state. L5 is episodic knowledge. You need all five.

The Architectural Decisions That Fall Out

  1. AGENTS.md is the source of truth. It's now governed by the Linux Foundation's Agentic AI Foundation. Codex CLI, Copilot, Cursor, Windsurf, Aider, Devin, Warp, and Antigravity (v1.20.3+) read it natively. Claude Code and Gemini CLI need a one-line bridge file or a symlink. One file. No duplication.

  2. Symlink the tool-specific files to AGENTS.md. CLAUDE.md, GEMINI.md, WARP.md, .github/copilot-instructions.md — all symlinked. The moment you maintain two files with overlapping content, drift starts and the whole stack rots. This is the single highest-leverage architectural decision.

  3. Beads for task state, not Claude Code Tasks. Claude Code Tasks (introduced 2026, explicitly inspired by Beads) is excellent but locked to Claude Code. With multiple agents in play, you need git-synced state visible to all of them.

  4. Mem0 for cross-session knowledge. Has an MCP server, works with every tool, most mature option. Zep is the alternative for temporal reasoning. Letta if you want to rebuild around memory entirely.

  5. One repo, one standard file tree. Every repo gets the same scaffolding. Make a create-project script. Treat any deviation as a bug.

The Two Categories Most People Conflate

Task state != knowledge memory. This is the most important distinction in the entire space and it gets blurred constantly.

  • Task state (Beads): "bd-42 is blocked by bd-37, claimed by Alice's agent, discovered from bd-19"
  • Knowledge memory (Mem0): "We tried Redis Streams for the queue and it broke under load"
  • Instructions (AGENTS.md): "Use pnpm. Tests required for new code. Don't edit packages/generated/."

Most "AI memory" products solve the second category and call it "memory." That's true but partial. The first category is what actually changes whether agents can pick up where they left off across sessions and across each other.

Beads — The Honest Take

Beads is Steve Yegge's CLI issue tracker that gives AI coding agents persistent task state across sessions. It stores work as a dependency-aware graph in a Dolt SQL database synced via JSONL files in git. Hash-based IDs prevent merge collisions. bd ready --json returns unblocked work in priority order.

What it offers that no built-in agent memory does:

  • Queryable dependency graph
  • Cross-tool persistence (any agent reads it)
  • Atomic claim semantics for multi-agent work
  • Team-shared work state via git
  • Audit trail through git history
  • Discovery linkage (--deps discovered-from:bd-X)

The limitations nobody warns you about:

Agents don't reach for it unprompted. Yegge himself calls it "a leaky abstraction." Instructions in AGENTS.md lose weight as context grows mid-session. Mitigation: install the beads-mcp MCP server so agents see it as a first-class tool, not a CLI to forget. Wire bd sync into git hooks. Build "land the plane" as a single command.

Sync conflicts and occasional data loss. Yegge has publicly documented sessions where issues vanished during merges. Unpushed work in multi-agent setups causes severe conflicts. Mitigation: bd doctor --fix on a schedule. Treat git push as part of done. Use Dolt server mode for real concurrent writes.

Scale ceiling around 500 open issues. When agents read issues.jsonl directly with jq, they hit ~25K token limits. Mitigation: bd admin compact regularly. Force agents to use bd ready --json not cat issues.jsonl.

Project maturity. Yegge himself describes the architecture as "crummy by pre-AI standards, requires AI to work around its edge cases." Daily releases. Single-maintainer bus factor. Mitigation: pin versions, upgrade weekly not daily.

Wrong tool for wrong work. Beads is for "current week" — not PRDs, not roadmaps. Stuffing roadmap items destroys bd ready signal.

Granularity is on you. Issues over ~2 minutes of work cause agent context rot mid-task.

The meta-point worth saying out loud: every Beads limitation has a workaround, but most workarounds are human discipline, not tooling. If your team will enforce "land the plane" rituals and run bd doctor regularly, Beads earns its keep. If you're hoping to install it and forget about it, you'll get burned.

The Pilot Plan

Setup: One repo. 2–3 developers who already use Claude Code or equivalent heavily. Decide upfront: replacement for your tracker, or agent-only memory layer that mirrors to Jira? Two weeks.

Week 1: Single-agent workflows. Each dev runs sessions using bd ready to pick work and "land the plane" to close. Granularity audit on day 5.

Week 2: Stress tests deliberately designed to trigger known failure modes:

  • Concurrent claim test (two devs claim same issue within a minute)
  • Cross-session handoff (Dev A files issues, Dev B picks them up next day)
  • "Forgot to push" recovery (deliberately end without git push, see how painful resolution is)
  • Data loss audit (compare bd list --status all --json | jq 'length' across machines — must match)

The single most predictive metric: unprompted query rate. If agents aren't reaching for bd ready --json on their own by end of week 2, no amount of tooling will save the rollout.

Go/no-go: Green at ≥70% unprompted queries, zero unrecovered data loss, ≥90% land-the-plane completion. Red on any unrecovered data loss.

Alternatives Worth Naming

For task state:

  • Claude Code Tasks (great but locked to Claude Code, ~70% of Beads' value)
  • GitHub Issues + gh CLI (works for humans but lacks ready semantics)
  • Linear / Jira via MCP (pragmatic if your org already runs them)

For knowledge memory:

  • Mem0 (most mature, MCP server)
  • Zep with Graphiti (temporal knowledge graph, ~15 points better on LongMemEval for time-sensitive reasoning)
  • Letta, formerly MemGPT (rebuild around memory entirely; Letta Code shipped March 2026)
  • Anthropic's Memory Tool (managed /memories directory but only via API, not Claude Code)

Part 2: Multi-LLM Orchestration

Anthropic's Claude Code Max plan is $200/month. Z.AI's GLM Coding Plan is $18. The GLM model self-reports 94.6% of Claude Opus's coding performance. Independent benchmarks suggest 75-85% on real-world tasks. So the question listeners are quietly asking: do I really need to keep paying Anthropic?

The honest answer is more interesting than yes or no. The right framing is: which 20% of your work needs Opus, and which 80% can run on something a tenth the cost? Get that allocation right and you cut your spend by 60–70% while preserving quality where it actually matters. Get it wrong and you spend three hours undoing a confident-but-wrong architectural answer from a cheap model that should never have been asked the question.

Why This Connects Back to Memory

Multi-LLM orchestration is impossible without the memory stack from Part 1. When you swap from Claude Opus to GLM-5.1 mid-task, the new model gets nothing the old one learned during the session except what's in visible context. KV caches don't transfer. Auto-memory doesn't transfer.

But Beads state does transfer because it's a queryable graph in git. Mem0 does transfer because it's a remote MCP server any model can hit. AGENTS.md does transfer because it's project state, not session state.

The memory stack isn't just a nice-to-have alongside multi-model routing. It's what makes multi-model routing functional rather than chaotic. Without it, every model swap is a context reset.

The Three Patterns That Actually Work

Pattern 1: Subscription substitution. Replace Anthropic Max with Z.AI's GLM Coding Plan. Set ANTHROPIC_BASE_URL and ANTHROPIC_MODEL env vars and Claude Code talks to GLM. Skills, MCP, subagents, hooks all keep working. Best when you want one cheaper model for everything.

Pattern 2: Routed hybrid via Claude Code Router (CCR). @musistudio/claude-code-router sits between Claude Code and any combination of providers. Standard four-tier routing:

  • default → GLM 4.7 (everyday work)
  • background → DeepSeek (silent tool calls and file scans)
  • think → Kimi K2 Thinking (multi-step reasoning)
  • longContext → DeepSeek or local Gemma (>100K tokens)

Pattern 3: Tier-by-phase. Anthropic Pro for the hard 20%. GLM Coding Plan for the daily 80%. Don't try to route automatically — devs switch tools consciously. Total ~$40/month vs. $200/month Max.

What Each Model Is Actually Good At (April 2026)

Model Best for Avoid for Cost
Claude Opus 4.6/4.7 Hard architecture, gnarly debugging, long agentic chains Routine implementation $$$$
Claude Sonnet 4.6 General default if paying Anthropic anyway $$$
GLM-5.1 Coding-heavy daily work, refactors, near-Opus quality Frontier reasoning, novel domains $
GLM-4.7 "Competent junior dev" workhorse Architecture decisions $
Kimi K2 / K2.5 / K2 Thinking Multi-step reasoning, long-horizon planning Speed-critical interactive work $$
DeepSeek V3.2 / V4 Cheap bulk work, file scans, background tasks Anything requiring nuance cents
Qwen 3.5/3.6 Long context (1M+), Chinese-language work Frontier coding $

The Caveats That Belong in the Conversation

  • Vendor benchmarks lie politely. GLM-5.1's "94.6% of Opus" is self-reported using Claude Code as the testing harness — home-field advantage.
  • Claude Code is not actually model-agnostic. Same model performs better through OpenCode or Cline than through Claude Code.
  • The 16K system prompt cache trap. Non-caching providers multiply input costs by 4–5x.

Failure Modes That Bite

  • Don't break tool calling (MCP servers stop working with malformed JSON).
  • Don't switch models mid-conversation when context is dense.
  • Don't violate Z.AI's ToS (GLM Coding Plan restricted to supported tools).
  • Don't cheap out on architecture decisions — hard rule: design questions go to Opus.
  • Don't forget to set ANTHROPIC_DEFAULT_HAIKU_MODEL.
  • Don't bet a deadline on it before a month of low-stakes use.

Per developer: ~$60/month total

  • Anthropic Pro (~$20) for the hard 20%
  • GLM Coding Plan Lite ($18) for the daily 80%
  • OpenRouter prepaid ($20 buffer) for overflow

Vs. $200 for Max alone. Quality on the bottom 80% essentially indistinguishable.


The Single Sentence

Memory is what turns a collection of cheap models into a coherent team; without it, you're paying less to lose more.

Most teams don't have a memory problem — they have a discipline problem. Adding more memory tools to a team that doesn't use the ones they have just adds entropy.

But a solo developer with a serious memory stack and intelligent routing operates at the productivity of a 3-person team from 2024. The compounding is real, the discipline tax is real, and most teams haven't figured this out yet — which is the opportunity.