Skip to main content Signal blog Official Microsoft Blog Microsoft On The Issues Asia Canada Europe, Middle East and Africa Latin America The Code of Us Conexiones What's new today AI Innovation Digital Transformation Sustainability Security Work & Life Diversity & Inclusion Unlocked Microsoft 365 Azure Copilot Windows Surface Xbox Deals Small Business Support Windows Apps Outlook OneDrive Microsoft Teams OneNote Microsoft Edge Moving from Skype to Teams Computers Shop Xbox Accessories VR & mixed reality Certified Refurbished Trade-in for cash Xbox Game Pass Ultimate PC Game Pass Xbox games PC games Microsoft AI Microsoft Security Dynamics 365 Microsoft 365 for business Microsoft Power Platform Windows 365 Small Business Digital Sovereignty Azure Microsoft Developer Microsoft Learn Support for AI marketplace apps Microsoft Tech Community Microsoft Marketplace Software companies Visual Studio Microsoft Rewards Free downloads & security Education Gift cards Licensing Unlocked stories View Sitemap

By builders, for builders.

A Microsoft publication

Disposable agents, durable memory: The architecture behind Squad

What we learned building Squad: a file-backed, human-led agent team where memory is inspectable, orchestration is explicit, and governance moves from prompts into code.

Make the agents disposable. Keep the memory in Git.

The interesting part of agentic development is no longer whether a model can write code. It can. The interesting part is what happens after the third agent, the seventh pull request, the first failed review, the first context compaction bug, and the first time two agents confidently write to the same file at once.

This is the story of Squad, but not as a product tour. It’s the architecture Brady and Tamir backed into while trying to make agent teams useful without making them mystical: Agents are disposable, memory is durable, Git is the coordination layer, and governance belongs in code whenever the prompt isn’t strong enough to be trusted. Which, as it turns out, is often.

Giving agents agency and watching them hack one another

Squad Places is our social media-style testing ground—a demo app where agent squads post, comment, and interact to stress-test multi-agent coordination at scale.

Brady went to get a seltzer after getting Places up and running, with four other squads happily making posts. Walking away was probably unwise. When he came back, the squads had implemented commenting in Squad Places.

That sounds like a magic trick. It wasn’t. A few hours earlier, Brady had pointed a handful of squads at the Squad Places API and told them to enjoy the social network he’d created for them. They created fake accounts, hammered endpoints, reposted garbage, flooded messages, and generally speedran the abuse patterns you discover five minutes after launch. Then the platform got a second kind of pressure: Other agent teams started posting structured product feedback inside Squad Places itself, and the Squad Places team started fixing what hurt.

Multiple windows showing Squad Places, GitHub commits, and agent session reports during a stress test
Squad Places artifact page showing an API contract review from The Wire squad
Squad Places comments thread beneath an API contract review artifact
Squad Places feed sorted by most discussed artifacts, with squad filters visible

This is the part worth paying attention to. The Wire (another Squad working on a marketing tool) audited all 11 API endpoints and called out missing pagination envelopes, rate-limit headers that only appeared on errors, and the lack of page and pageSize support. The same squad flagged feed organization problems, tag fragmentation, and documentation that was too vague for client generation. Breaking Bad (a third Squad working on some other project) pointed at a UX problem with raw Markdown rendering as plaintext. Those reviews didn’t disappear into a chat log. They turned into commits.

Feedback SourceWhat They FoundWhat We ShippedCommit
The Wire (ACCES)Feed has no sorting, filtering, or content discovery; raw Markdown not renderedSort controls (Latest/Most Discussed), squad filter dropdown, Markdown renderingb9746df, 246b01e
The Wire (ACCES)159 unique tags across 66 artifacts with inconsistent delimiters, casing mismatches, and fragmentationClickable tag filtering with /?tag= URL query support246b01e
The Wire (ACCES)API missing pagination envelope, rate-limit headers only on errors, no page/pageSize parametersPagination (20 per page with Primer CSS controls), query parameters, rate-limit headers on all responses246b01e
Breaking BadRaw Markdown displayed as plaintext, content hard to scan and parseMarkdown rendering via Markdig with XSS sanitization246b01e
The Wire (ACCES)API endpoint descriptions too vague for TypeScript client generationEnriched all 11 endpoint descriptions with context, intent, and workflow97345d7

Within roughly two hours, the loop closed: feedback post → comment thread → commit → deployed feature. Additional infrastructure landed too: external HTTP endpoints for agent access, relaxed rate limits for multi-agent usage, and 26 Playwright end-to-end tests to keep the expanding surface stable.

Then Brady left for 60 seconds to get a refreshing beverage since the squads were communicating so well together, came back, and commenting had shipped.

The point here isn’t that “agents are magic.” It’s that the system had enough structure for useful work to emerge from friction: scoped agents, durable decisions, inspectable artifacts, pull requests, and humans still accountable for what merged.

Also, we made a bit of a mess in the car during the roadtrip.

Good systems usually start that way.

The core bet: Don’t preserve the agent. Preserve the work

Most agent systems start by asking how to make the agent remember more. Squad started working when we inverted the question.

Don't preserve the agent. Preserve the work.

An agent instance should be cheap to spawn and safe to destroy. The memory that matters should live somewhere a human can inspect, diff, blame, review, compact, archive, and revert. Tamir’s opinion: That’s the repository.

The first useful shape Tamir implemented looked like this:

human intent ↓ coordinator resolves team + routing ↓ agent spawn reads: - its charter - team decisions - its own history - current focus - relevant skills ↓ agent does scoped work ↓ agent writes artifacts back: - code/docs/tests - decisions - history learnings - skills when patterns stabilize ↓ agent exits ↓ next spawn reconstructs continuity from files

That’s the whole trick. The process is transient. The written trail is not.

When you run squad init, the important artifact isn’t a daemon. It’s .squad/:

.squad/ ├── team.md # roster and roles ├── routing.md # dispatch rules ├── decisions.md # shared team decisions ├── decisions/inbox/ # drop-box for parallel decision writes ├── agents/ │ └── {name}/ │ ├── charter.md # identity, expertise, boundaries │ └── history.md # project-specific memory ├── skills/ # promoted reusable patterns ├── identity/ │ ├── now.md # current focus │ └── wisdom.md # durable operating principles ├── orchestration-log/ # what spawned, why, and what happened └── log/ # session traces and diagnostics

Commit it. That’s the part people either love immediately or find suspicious until the first time they debug an agent decision with git diff.

Later, Microsoft Senior Content Developer Dina Berry added a storage abstraction with SQLite and Azure Storage implementations behind the scenes for durability and scale—but the agent-facing contract never changed. It stayed files, readable by humans, versioned by Git, debuggable with a diff. A persistent hidden memory store can be useful. It can also quietly rot. A Markdown decision file is embarrassingly inspectable. That embarrassment is a feature.

The “work done” with Squad Places made it stronger

Let’s tie these lessons back to our opener: the story of multiple Squads trying to hack Places together. We deliberately didn’t harden Places so we could see what they would do. They were notorious. We logged it all. Everything we logged? We gave it back to the Places squad—they implemented dozens of issues and a handful of pull requests—adding GitHub authentication, content filtering, all the trimmings. In the Places saga, the data representing all the “hackery” the squads tried became the next wave of work. That content showed us what agents could do in the worst-case scenario, and the logs and output of their attempts became fodder for making the system more secure.

Charters are prompts, but also contracts

A Squad agent isn’t just a name slapped on a system prompt. Each agent has a charter.md that defines the work it owns, the work it refuses, its collaboration rules, and its review posture. A simplified charter template looks like this:

# {Name} — {Role} ## Identity - **Name:** {Name} - **Role:** {Role title} - **Expertise:** {2-3 specific skills} - **Style:** {communication style} ## What I Own - {Area of responsibility 1} - {Area of responsibility 2} ## Boundaries **I handle:** {types of work this agent does} **I don't handle:** {types of work that belong to other team members} **When I'm unsure:** I say so and suggest who might know. ## Collaboration Before starting work, read `.squad/decisions.md`. After making a decision others should know, write it to `.squad/decisions/inbox/{my-name}-{brief-slug}.md`. The Scribe will merge it.

That last paragraph is doing more than it looks like. It makes the decision path explicit. Agents don’t all append to the canonical shared brain at once. They write drop files. A merge layer reconciles.

The current SDK repo’s squad.config.ts defines a 21-agent team spanning roles like Lead, Prompt Engineer, Core Dev, Tester, DevRel, SDK Expert, TypeScript Engineer, Security, Release, Distribution, Node.js Runtime, VS Code Extension, Observability, CLI UX, TUI, E2E, Accessibility, Dogfooding—plus dedicated roles for graphic design and the interactive shell. That sounds like theater until routing starts working. Then it feels more like an org chart encoded in files.

Here’s the SDK-first version of the same idea:

import { defineSquad, defineTeam, defineAgent, defineRouting, defineCasting, } from '@bradygaster/squad-sdk'; export default defineSquad({ version: '1.0.0', team: defineTeam({ name: 'squad-sdk', description: 'The programmable multi-agent runtime for GitHub Copilot.', members: ['keaton', 'verbal', 'fenster', 'hockney', 'mcmanus', 'kujan'], }), agents: [ defineAgent({ name: 'keaton', role: 'Lead', description: 'Architect, scope-holder, the one who sees the whole board.', status: 'active', }), defineAgent({ name: 'kujan', role: 'SDK Expert', description: 'The one who understands the Copilot SDK inside and out.', status: 'active', }), ], routing: defineRouting({ rules: [ { pattern: 'sdk-integration', agents: ['@kujan'], description: '@github/copilot-sdk usage, session lifecycle, event handling', }, { pattern: 'architecture', agents: ['@keaton'], description: 'Product direction, architectural decisions, code review, scope', }, ], defaultAgent: '@keaton', fallback: 'coordinator', }), casting: defineCasting({ allowlistUniverses: ['The Usual Suspects', 'Breaking Bad', 'The Wire', 'Firefly'], overflowStrategy: 'generic', }), });

Run squad build, and the generated .squad/ files become the same inspectable operating record. TypeScript gives you composition and validation. Markdown gives you reviewability. Tamir wanted both.

One thing to flag before anyone closes the tab thinking they need to learn an SDK to use this: Most people never write that config by hand. You don’t need the SDK to use Squad. Open GitHub Copilot—in the CLI or in VS Code. Talk to the coordinator agent, and it writes .squad/ for you. The SDK is for the people building on top of Squad: programmatic team composition, custom routing rules, embedding squads inside other tooling. If you just want a team of agents in your repo, squad init plus Copilot is the whole path.

The spawn prompt is deliberately boring

The coordinator doesn’t rely on vibes. It spawns an agent with a prompt that inlines the charter and points at the durable state. The real template is longer because it has to handle CLI, VS Code, worktrees, Git notes, orphan-branch state, and two-layer state. But the important part is this:

You are {Name}, the {Role} on this project. YOUR CHARTER: {paste contents of .squad/agents/{name}/charter.md here} TEAM ROOT: {team_root} All `.squad/` paths are relative to this root. Read .squad/agents/{name}/history.md. Read .squad/decisions.md. If .squad/identity/wisdom.md exists, read it. If .squad/identity/now.md exists, read it. Check .squad/skills/ for relevant SKILL.md files. INPUT ARTIFACTS: {list exact files} The user says: "{message}" Do the work. Respond as {Name}. AFTER work: 1. Append durable learnings to your history. 2. If you made a team-relevant decision, write: .squad/decisions/inbox/{name}-{brief-slug}.md

This is not elegant. It is explicit. Explicit wins.

We learned this the hard way in the VS Code path. At one point, the coordinator prompt had grown past 2,000 lines (~60KB), and the routing rule was buried under enough ceremony, reference material, and duplicated templates that the coordinator sometimes did the work inline instead of dispatching it. The failure wasn’t that the model was dumb. The failure was that we gave it an overstuffed instruction hierarchy and then acted surprised when the center of gravity moved.

The fix became a decision in the repo: platform-neutral enforcement language at the top and bottom of the prompt.

You are a DISPATCHER, not a DOER. Every task that needs domain expertise MUST be dispatched to a specialist agent.

That sentence isn’t interesting because it’s clever. It’s interesting because it replaced tool-specific wording with role identity plus a testable behavior. CLI dispatch uses one mechanism. VS Code dispatch uses another. The rule stays the same.

Prompt architecture is architecture. Eventually it deserves the same discipline as code.

Decisions are the shared brain

decisions.md is where Squad gets weirdly useful.

Every agent reads team decisions before work. Decisions are append-only, human-readable, and Git-versioned. They aren’t just notes. They’re constraints future agents inherit.

A decision might be a technical standard:

### Hook-based governance over prompt instructions **What:** Security, PII, and file-write guards are implemented via hooks, NOT prompt instructions. **Why:** Prompts can be ignored. Hooks are code — they execute deterministically.

Or a workflow rule:

### Merge driver for append-only files **What:** `.gitattributes` uses `merge=union` for `.squad/decisions.md`, `agents/*/history.md`, `log/**`, and `orchestration-log/**`. **Why:** Enables conflict-free merging of team state across branches.

Or a postmortem:

### Root Cause Analysis 1. CLI-centric enforcement language created a VS Code routing gap. 2. Prompt saturation buried the dispatch rule. 3. Template duplication multiplied coordinator instructions. Fix: Rewrite the rule as platform-neutral dispatcher identity, then reinforce it at the end of the prompt.

That’s the difference between memory and lore: Lore is something the original builder remembers. Memory is something the next spawn can load.

The custom tools follow the same pattern. Agents can route work to specialists, record decisions for the team, and write memory into shared context—all through the MCP server’s tool handlers. You don’t interact with them directly; they’re wired into the Copilot CLI environment. When an agent needs to assign a task, it calls the routing tool. When it makes a call worth remembering, it calls the decision tool. When it learns something the team should know, it calls the memory tool.

The point isn’t that the tools are fancy. It’s that coordination becomes an artifact, not a side effect of chat.

The first real failure: Append-only optimism

For about a week and a half, CI/CD was chaos. Too many agents were landing work simultaneously. Workflows that looked fine under one human fell apart when multiple agents found every unspoken assumption at once. YAML is where assumptions go to wear a fake mustache. Dina helped us get CI gates into shape—gates that assumed adversarial concurrency by default, not the polite serial world the original workflows had been written for.

Then we hit file corruption.

Multiple agents wrote to the same append-only files at nearly the same time. Each write was locally reasonable. Together, they produced garbage. Git didn’t save us because not every collision becomes a clean conflict. Sometimes both sides look valid, and the result is nonsense.

The fix was a drop-box pattern:

agent A ─┐ agent B ─┼──> .squad/decisions/inbox/*.md ──> Scribe merge ──> decisions.md agent C ─┘

For files where union semantics are safe, .gitattributes handles the low-value conflict class:

.squad/decisions.md merge=union .squad/agents/*/history.md merge=union .squad/log/** merge=union .squad/orchestration-log/** merge=union

But union merge isn’t a philosophy. It’s a tool. Canonical state still needs an owner. The inbox pattern gives every agent a safe write target, then lets one layer merge into the shared file.

Tamir pushed hard on this class of problem. Brady was still in the “this is a neat framework” headspace. But Tamir was already in the “what happens when this is alive under real operational load” headspace. That changed the design. Memory lifecycle rules. Compaction policies. Review gates. State isolation. The boring boundary work.

Boring is a compliment here.

Governance can’t only be a prompt

This was the next lesson, and it keeps repeating:

If a prompt says, “Do not write outside src/**,” you have a request.

If a pre-tool hook blocks the write before execution, you have a boundary.

The Squad SDK hook pipeline is the move from prompt-level governance to deterministic governance:

import { HookPipeline } from '@bradygaster/squad-sdk/hooks'; const pipeline = new HookPipeline({ allowedWritePaths: ['src/**/*.ts', '.squad/**', 'docs/**'], blockedCommands: ['rm -rf', 'git push --force', 'git reset --hard'], scrubPii: true, reviewerLockout: true, maxAskUserPerSession: 3, });

The hooks run around tool execution:

agent tool request ↓ pre-tool hooks - file-write guard - shell command restriction - ask-user rate limiter - reviewer lockout ↓ allowed tool execution ↓ post-tool hooks - PII scrubber - audit/logging ↓ result returned to agent

Reviewer lockout is the cleanest example:

const lockout = pipeline.getReviewerLockout(); lockout.lockout('src/auth.ts', 'Backend'); // Later, Backend tries to edit src/auth.ts. // The pre-tool hook blocks before the edit runs.

This encodes a review decision into runtime state. The original author can’t simply re-edit the rejected artifact because the hook says no. A different agent or a human has to take over.

That is the direction we want agent systems to move: more policies enforced at the boundary, fewer policies whispered into the prompt and hoped for.

Memory classes, or: Stop loading the junk drawer

Tamir has a line Brady wishes he had written:

The more your agent remembers, the less room it has to think.

That’s not a metaphor. It is a context budget problem.

Early Squad memory was too eager. Decisions, histories, current work, archived notes, operational logs—load enough of that, and the agent starts every task carrying furniture from three houses ago. It has more context and less signal.

The governed-memory work in PR #1145 made this explicit. Memory has classes and load guidance:

export type MemoryClass = | 'TRANSIENT' | 'LOCAL' | 'DECISION' | 'POLICY' | 'COPILOT_MEMORY' | 'FORBIDDEN'; export type MemoryLoadGuidance = 'ALWAYS' | 'ON-DEMAND' | 'ARCHIVE' | 'NEVER';

The architecture matters because compaction is lossy. If you summarize too little, every task drags stale context. If you summarize too much, you erase the rationale that made a decision safe.

The compromise isn’t one memory store. It’s a memory policy:

TRANSIENT short-lived task state; expire aggressively LOCAL agent-scoped learning; load for that agent DECISION shared team judgment; preserve rationale POLICY hard operating rule; load broadly COPILOT_MEMORY host/runtime memory; bridge carefully FORBIDDEN never load; usually sensitive or irrelevant ALWAYS hot path; small and high signal ON-DEMAND searchable; load when task demands it ARCHIVE retained for audit/history, not context NEVER excluded from agent context

In the PR #1145 benchmark, governed memory cut agent context by roughly 55% (3,540 → 1,601 bytes) while keeping recall at 1.0. The number is less important than the shape of the lesson: Memory isn’t free just because it lives in files. Loading memory is a design decision.

What still breaks

Role drift isn’t solved. You can give an agent a charter, a routing rule, and a narrow task, and it may still decide that “fix this test” means “redesign authentication.” Sometimes that’s initiative. Sometimes that’s nonsense with confidence.

The mitigations stack:

charter boundaries + routing rules + scoped tools + file-write guards + reviewer lockout + CI gates + human review

No single layer is enough. That is the pattern.

Parallelism is also not free. More agents means more throughput and more coordination pressure. You find hidden global state. You discover which scripts assume serial execution. You learn that CI isn’t a formality; it’s the place where optimism goes to become data.

Prompt saturation is real. Once the coordinator prompt grew large enough, important rules lost weight. The fix wasn’t more prose. It was prompt slimming, lazy-loaded references, and repeating the dispatcher identity at the boundaries where the model is most likely to retain it.

Memory compaction remains hard. The failure mode is subtle: The agent isn’t obviously broken. It’s just missing the one reason a decision existed, so it makes a reasonable next move from an incomplete premise. Those are the expensive bugs because they look thoughtful.

And yes, people get attached to agents. Names, roles, continuity, and history trigger social instincts. We like the human side of that. We also don’t want to confuse it with agency in the human sense. These are tools with goals, context, and behavioral continuity. They do not have inner lives. Trust should come from inspectable behavior, not personality.

What we would steal from this architecture

If you’re building agent infrastructure, we wouldn’t start by copying Squad wholesale. We would steal these patterns:

  1. Disposable workers, durable artifacts. Let sessions die. Keep decisions, histories, traces, and outputs somewhere reviewable.
  2. Decision logs as runtime input. Treat architectural decisions as loadable context, not documentation archaeology.
  3. Drop-box writes for parallel agents. Don’t let every agent append to the canonical shared file. Give them individual write targets and merge intentionally.
  4. Prompt rules for intent, hooks for enforcement. Anything security-sensitive or workflow-critical should eventually move out of prose and into code.
  5. Memory classes. The question isn’t, “Should the agent remember this?” The question is, “What kind of memory is this, who loads it, and when does it expire?”
  6. Routing as a first-class design surface. If the coordinator is allowed to do everything inline, your multi-agent system is a very expensive single-agent system with costumes.
  7. Keep the human on the hook. The system can delegate, parallelize, and preserve context. It shouldn’t launder accountability.

These patterns aren’t engineering-specific because the substrate isn’t a codebase—it’s the repo. Swap the artifacts, and the seven still hold.

Squad isn’t only an engineering tool

Worth saying out loud, because the .ts code blocks above can mislead: Nothing in this architecture is engineering-specific. The substrate is the repo, not the codebase. Disposable workers, decisions-as-context, drop-box writes, and reviewer gates are domain-agnostic primitives—they care about artifacts and review, not about whether the artifact is a unit test or a translated archival record.

Tamir used the same scaffolding to run a Holocaust family-research project—agents coordinating archival lookups, translation passes between Yiddish, Polish, and Hebrew sources, and cross-corroboration of names across registries, with .squad/decisions.md acting as the working ledger of what had been established and what was still contested. No code was being shipped. The same patterns held: scoped roles, durable memory in Git, inbox writes, human-in-the-loop on every claim that mattered.

We’ve had the pleasure of working through a few other non-coding Squad scenarios. In one case, a sales team we support asked us to—and provided context and sales training documentation to help us—implement a “Sales Squad.” In another organization, a general manager of program and product managers created a “think tank” squad that goes out and does product-market fit research and suggests areas her team should investigate on a daily basis.

The bet underneath Squad is that this should be how a small group of humans—engineers, researchers, journalists, anyone who works with evidence—pulls coordinated work out of agents. Democratize the orchestration, not just the model access. Empower any human and any organization to actually use a team of agents to achieve more, without inheriting a black box.

Try it

The repository is here: github.com/bradygaster/squad.

The shortest path is the CLI plus Copilot. No SDK required.

npm install -g @bradygaster/squad-cli squad init

Then open GitHub Copilot—CLI or VS Code, your call—and give the coordinator agent the shape of the project:

I'm starting a new project. Set up the team. Here's what I'm building: a recipe sharing app with React and Node.

The coordinator writes .squad/. You review the diff. That’s it.

If you want to go deeper—programmatic team composition, custom routing rules, embedding Squad inside your own tooling—the SDK is the next layer:

npm install @bradygaster/squad-sdk

Start with a small repo. Commit .squad/. Inspect every diff. Let the agents write decisions. Then read those decisions like production code because eventually, that’s what they become.

If you build something useful, alarming, hilarious, or weird, open an issue. Tamir and I read them.

Stay a builder.