Microsoft Scout: From personal project to enterprise-ready personal agent

Early this year, OpenClaw demos were seemingly everywhere, though many seemed to amount at best to a cool party trick (“Look, my agent ordered a pizza”). But it got long-time Microsoft employee Omar Shahine thinking: How useful could claws actually be?

Very, it turned out. In his spare time, Shahine created Lobster, a personal AI assistant built on OpenClaw. It has its own Apple ID and email address, so he can text with it from any device with iMessage. He initially split Lobster into a trio of agents, each with its own security profile and tool access (eight weeks in, that number had increased to nine always-on agents). Lobster handles travel logistics, proactively sends family reminders ahead of time, and generally helps Shahine and his family stay organized and get things done. And after presenting Lobster to Microsoft’s AI Accelerator group, it landed Shahine a new job: bringing OpenClaw to M365 and the cloud as CVP of what was deemed “Project Lobster.”

At the same time, Microsoft Member of Technical Staff Jakob Werner was pursuing a similar idea with a twist: a desktop app-based agent inspired by OpenClaw. The goal was to deliver a powerful enterprise-secure personal AI assistant that anyone within Microsoft could use. In just a couple weeks, what was referred to internally as “Clawpilot” had already been downloaded by thousands of Microsoft employees, and that community continues to grow.

When Shahine started assembling a small team of enthusiastic builders—Ocean’s 11, naturally—Werner quickly joined their ranks. The two recently caught up in Redmond, Washington, to compare notes on building these always-on, autonomous agents and navigating the worlds of enterprise security, agentic memory, and more.

Embracing the spirit of open source

The Project Lobster team is representative of a new way of working within Microsoft, fueled by AI advancements. It’s a tight-knit group that prefers to collaborate asynchronously. There’s a general consensus against meetings. Everyone contributes to the codebase, including Shahine. And there’s no traditional executive assistant among their ranks: Each team member actively uses prototypes throughout the day to fully immerse themselves in the tech as they’re building it. There’s even a growing open-source community around the team that mirrors what’s found with open-source projects outside Microsoft’s walls.

“I’ve never seen a project inside the company where so many people showed up with their ideas and their code and did the work to produce a PR,” says Shahine.

I’ve never seen a project inside the company where so many people showed up with their ideas.

In fact, internal excitement around Project Lobster has been such that the team fielded pull requests (PRs) left and right during the early building phase, which they reviewed to determine whether they met the bar to make it into the product. Even some of Shahine’s changes didn’t make the cut. The focus had to remain on the central goal of the product: Creating an always-on personal agent for work. An AI helper that learns your goals, adapts to your daily work patterns, and acts with context, identifying issues before they surface, keeping projects on track and driving outcomes without constant input. An agent that can detect when a calendar is overbooked and propose specific changes before the week begins or identify when a decision is stalled and draft a targeted follow-up to unblock it.

“We have to determine if a given PR changes the central idea of the product or not—and the speed of that review is human speed, not AI speed,” notes Werner. “Anyone can make a PR super quickly now. We’re trying to help the community and teach contributors how to review PRs.”

While the work began as an internal experiment, it quickly turned into a customer-focused effort that’s culminated with the introduction of Microsoft Scout—an always-on personal agent powered by OpenClaw open-source technology.

From experiment to enterprise-ready product

Microsoft Scout operates autonomously—with its own identity—acting on your behalf. It works across cloud, desktop, and web browser, so it can connect across the surfaces you use—Teams, Outlook, OneDrive, and SharePoint—and the systems where work lives, including email, calendar, and contacts.

Unlike your average claw in the wild, Microsoft Scout combines OpenClaw code with enterprise identity, governance, and security. Every package is ingested through a curated, signed Microsoft supply chain, and every tool call, model request, and network hop is mediated by a zero-trust runtime—the agent’s container is treated as untrusted, with Microsoft-controlled identity, tokens, and policy sitting outside it. With Agent 365, admins get a single control plane, and Microsoft Purview gives security teams the same compliance and DLP signal they already get from other M365 surfaces.

“It’s a super powerful tool,” acknowledges Werner. “And to be enterprise secure, we needed to make sure the data governance was right, that the privacy was right, and that it doesn’t cancel a meeting and send all your personal information to that email chain. If I send my agent to you, it shouldn’t tell you everything about me. These areas are possible to contain, but we also had to do it in a balanced way that doesn’t restrict the possibilities down to nothing.”

It’s a tradeoff worth making. And with Microsoft’s tried and trusted enterprise security offerings and ongoing research and innovation in the space, the team had a solid foundation from which to address the challenge.

The role of agentic memory

In order for an always-on personal AI agent to be truly useful, it needs to be proactive—and that requires context powered by Work IQ. Over time, Microsoft Scout understands the way you work, uses the same productivity tools you use, and takes things off your plate without the need for constant prompts. It learns your goals, adapts to your daily work patterns, and acts with intent. Unlike previous technological waves, this is software that’s truly personalized. That’s transformative, but it’s not without tradeoffs.

“OpenClaw, Claude Code, GitHub Copilot CLI, these are agentic coding harnesses that are basically remembering—writing things down just like people do,” Shahine notes. “They write things down like a diary. But just like it needs to remember things, it needs to forget some things, too.”

Just like it needs to remember things, it needs to forget some things, too.

As an example, Shahine points back to the introduction of memory to ChatGPT. He spent some time telling ChatGPT that his daughter was 17 while his son was 13. But a year later, that information remained static. The system didn’t have a concept that some facts need to change over time, while other pieces of information—like your name—will stay exactly the same.

“In the design phase, I was thinking about the human and how humans memorize things,” says Werner. “I forget things that are irrelevant because I didn’t use them. So I built a system where, if I’m going to use it repeatedly, it’s going to stick. But if I’m not going to use it regularly, I want the system to forget. I don’t want to have an infinite diary of things, right? So there’s kind of layers of memory, and it kind of disappears over time if it’s not used. Meanwhile, the relevance of other pieces of memory grows as you use them more.”

Forming a new center of gravity

When they first joined forces, Werner introduced Shahine to the concept of gravity—the framework around which he operated.

“To build a truly great product, I don’t think I can make it myself,” Werner explains. “We need to collaborate with other people. But how do we influence other people to collaborate with us? And the mindset I use and try to instill in my team is gravity. We build something and make it so big in influence—not in the number of features, but in its influence—that when exciting new ideas pop up, they want to try and join the gravity of our work rather than dissolve focus.”

“And I didn’t really know what you were talking about until my new role was announced,” admits Shahine. “But since then, I’ve received hundreds if not thousands of messages from people who want to help, people who want to learn, people who want to show me what they did, and customers who want to know ASAP when they’re going to get their hands on what we’re building. There are a lot of other words for that—user pull, signal—but your mantra of gravity really resonates with me now.”

Microsoft employees have already been using an early Microsoft Scout desktop experience. We built this to learn how always-on agents show up in real work, and we’re seeing it take on coordination, surface risks earlier, and keep work moving without constant prompting.

We’re now extending that early experience to Frontier organizations. Microsoft Scout is available as an experimental release through Frontier, giving customers a chance to explore how it can fit into their own workflows.

Access requires Frontier enrollment, Intune policy configuration, and an opt-in attestation. Users with a GitHub Copilot license can then download and install the experience. Learn more.

Disposable agents, durable memory: The architecture behind Squad

Make the agents disposable. Keep the memory in Git.

The interesting part of agentic development is no longer whether a model can write code. It can. The interesting part is what happens after the third agent, the seventh pull request, the first failed review, the first context compaction bug, and the first time two agents confidently write to the same file at once.

This is the story of Squad, but not as a product tour. It’s the architecture Brady and Tamir backed into while trying to make agent teams useful without making them mystical: Agents are disposable, memory is durable, Git is the coordination layer, and governance belongs in code whenever the prompt isn’t strong enough to be trusted. Which, as it turns out, is often.

Giving agents agency and watching them hack one another

Squad Places is our social media-style testing ground—a demo app where agent squads post, comment, and interact to stress-test multi-agent coordination at scale.

Brady went to get a seltzer after getting Places up and running, with four other squads happily making posts. Walking away was probably unwise. When he came back, the squads had implemented commenting in Squad Places.

That sounds like a magic trick. It wasn’t. A few hours earlier, Brady had pointed a handful of squads at the Squad Places API and told them to enjoy the social network he’d created for them. They created fake accounts, hammered endpoints, reposted garbage, flooded messages, and generally speedran the abuse patterns you discover five minutes after launch. Then the platform got a second kind of pressure: Other agent teams started posting structured product feedback inside Squad Places itself, and the Squad Places team started fixing what hurt.

Multiple windows showing Squad Places, GitHub commits, and agent session reports during a stress test

Squad Places artifact page showing an API contract review from The Wire squad

Squad Places comments thread beneath an API contract review artifact

Squad Places feed sorted by most discussed artifacts, with squad filters visible

This is the part worth paying attention to. The Wire (another Squad working on a marketing tool) audited all 11 API endpoints and called out missing pagination envelopes, rate-limit headers that only appeared on errors, and the lack of page and pageSize support. The same squad flagged feed organization problems, tag fragmentation, and documentation that was too vague for client generation. Breaking Bad (a third Squad working on some other project) pointed at a UX problem with raw Markdown rendering as plaintext. Those reviews didn’t disappear into a chat log. They turned into commits.

Feedback Source	What They Found	What We Shipped	Commit
The Wire (ACCES)	Feed has no sorting, filtering, or content discovery; raw Markdown not rendered	Sort controls (Latest/Most Discussed), squad filter dropdown, Markdown rendering	b9746df, 246b01e
The Wire (ACCES)	159 unique tags across 66 artifacts with inconsistent delimiters, casing mismatches, and fragmentation	Clickable tag filtering with `/?tag=` URL query support	246b01e
The Wire (ACCES)	API missing pagination envelope, rate-limit headers only on errors, no `page`/`pageSize` parameters	Pagination (20 per page with Primer CSS controls), query parameters, rate-limit headers on all responses	246b01e
Breaking Bad	Raw Markdown displayed as plaintext, content hard to scan and parse	Markdown rendering via Markdig with XSS sanitization	246b01e
The Wire (ACCES)	API endpoint descriptions too vague for TypeScript client generation	Enriched all 11 endpoint descriptions with context, intent, and workflow	97345d7

Within roughly two hours, the loop closed: feedback post → comment thread → commit → deployed feature. Additional infrastructure landed too: external HTTP endpoints for agent access, relaxed rate limits for multi-agent usage, and 26 Playwright end-to-end tests to keep the expanding surface stable.

Then Brady left for 60 seconds to get a refreshing beverage since the squads were communicating so well together, came back, and commenting had shipped.

The point here isn’t that “agents are magic.” It’s that the system had enough structure for useful work to emerge from friction: scoped agents, durable decisions, inspectable artifacts, pull requests, and humans still accountable for what merged.

Also, we made a bit of a mess in the car during the roadtrip.

Good systems usually start that way.

The core bet: Don’t preserve the agent. Preserve the work

Most agent systems start by asking how to make the agent remember more. Squad started working when we inverted the question.

Don't preserve the agent. Preserve the work.

An agent instance should be cheap to spawn and safe to destroy. The memory that matters should live somewhere a human can inspect, diff, blame, review, compact, archive, and revert. Tamir’s opinion: That’s the repository.

The first useful shape Tamir implemented looked like this:

human intent
  ↓
coordinator resolves team + routing
  ↓
agent spawn reads:
  - its charter
  - team decisions
  - its own history
  - current focus
  - relevant skills
  ↓
agent does scoped work
  ↓
agent writes artifacts back:
  - code/docs/tests
  - decisions
  - history learnings
  - skills when patterns stabilize
  ↓
agent exits
  ↓
next spawn reconstructs continuity from files

That’s the whole trick. The process is transient. The written trail is not.

When you run squad init, the important artifact isn’t a daemon. It’s .squad/:

.squad/
├── team.md                  # roster and roles
├── routing.md               # dispatch rules
├── decisions.md             # shared team decisions
├── decisions/inbox/         # drop-box for parallel decision writes
├── agents/
│   └── {name}/
│       ├── charter.md       # identity, expertise, boundaries
│       └── history.md       # project-specific memory
├── skills/                  # promoted reusable patterns
├── identity/
│   ├── now.md               # current focus
│   └── wisdom.md            # durable operating principles
├── orchestration-log/       # what spawned, why, and what happened
└── log/                     # session traces and diagnostics

Commit it. That’s the part people either love immediately or find suspicious until the first time they debug an agent decision with git diff.

Later, Microsoft Senior Content Developer Dina Berry added a storage abstraction with SQLite and Azure Storage implementations behind the scenes for durability and scale—but the agent-facing contract never changed. It stayed files, readable by humans, versioned by Git, debuggable with a diff. A persistent hidden memory store can be useful. It can also quietly rot. A Markdown decision file is embarrassingly inspectable. That embarrassment is a feature.

The “work done” with Squad Places made it stronger

Let’s tie these lessons back to our opener: the story of multiple Squads trying to hack Places together. We deliberately didn’t harden Places so we could see what they would do. They were notorious. We logged it all. Everything we logged? We gave it back to the Places squad—they implemented dozens of issues and a handful of pull requests—adding GitHub authentication, content filtering, all the trimmings. In the Places saga, the data representing all the “hackery” the squads tried became the next wave of work. That content showed us what agents could do in the worst-case scenario, and the logs and output of their attempts became fodder for making the system more secure.

Charters are prompts, but also contracts

A Squad agent isn’t just a name slapped on a system prompt. Each agent has a charter.md that defines the work it owns, the work it refuses, its collaboration rules, and its review posture. A simplified charter template looks like this:

# {Name} — {Role}

## Identity

- **Name:** {Name}
- **Role:** {Role title}
- **Expertise:** {2-3 specific skills}
- **Style:** {communication style}

## What I Own

- {Area of responsibility 1}
- {Area of responsibility 2}

## Boundaries

**I handle:** {types of work this agent does}

**I don't handle:** {types of work that belong to other team members}

**When I'm unsure:** I say so and suggest who might know.

## Collaboration

Before starting work, read `.squad/decisions.md`.
After making a decision others should know, write it to
`.squad/decisions/inbox/{my-name}-{brief-slug}.md`.
The Scribe will merge it.

That last paragraph is doing more than it looks like. It makes the decision path explicit. Agents don’t all append to the canonical shared brain at once. They write drop files. A merge layer reconciles.

The current SDK repo’s squad.config.ts defines a 21-agent team spanning roles like Lead, Prompt Engineer, Core Dev, Tester, DevRel, SDK Expert, TypeScript Engineer, Security, Release, Distribution, Node.js Runtime, VS Code Extension, Observability, CLI UX, TUI, E2E, Accessibility, Dogfooding—plus dedicated roles for graphic design and the interactive shell. That sounds like theater until routing starts working. Then it feels more like an org chart encoded in files.

Here’s the SDK-first version of the same idea:

import {
  defineSquad,
  defineTeam,
  defineAgent,
  defineRouting,
  defineCasting,
} from '@bradygaster/squad-sdk';

export default defineSquad({
  version: '1.0.0',

  team: defineTeam({
    name: 'squad-sdk',
    description: 'The programmable multi-agent runtime for GitHub Copilot.',
    members: ['keaton', 'verbal', 'fenster', 'hockney', 'mcmanus', 'kujan'],
  }),

  agents: [
    defineAgent({
      name: 'keaton',
      role: 'Lead',
      description: 'Architect, scope-holder, the one who sees the whole board.',
      status: 'active',
    }),
    defineAgent({
      name: 'kujan',
      role: 'SDK Expert',
      description: 'The one who understands the Copilot SDK inside and out.',
      status: 'active',
    }),
  ],

  routing: defineRouting({
    rules: [
      {
        pattern: 'sdk-integration',
        agents: ['@kujan'],
        description: '@github/copilot-sdk usage, session lifecycle, event handling',
      },
      {
        pattern: 'architecture',
        agents: ['@keaton'],
        description: 'Product direction, architectural decisions, code review, scope',
      },
    ],
    defaultAgent: '@keaton',
    fallback: 'coordinator',
  }),

  casting: defineCasting({
    allowlistUniverses: ['The Usual Suspects', 'Breaking Bad', 'The Wire', 'Firefly'],
    overflowStrategy: 'generic',
  }),
});

Run squad build, and the generated .squad/ files become the same inspectable operating record. TypeScript gives you composition and validation. Markdown gives you reviewability. Tamir wanted both.

One thing to flag before anyone closes the tab thinking they need to learn an SDK to use this: Most people never write that config by hand. You don’t need the SDK to use Squad. Open GitHub Copilot—in the CLI or in VS Code. Talk to the coordinator agent, and it writes .squad/ for you. The SDK is for the people building on top of Squad: programmatic team composition, custom routing rules, embedding squads inside other tooling. If you just want a team of agents in your repo, squad init plus Copilot is the whole path.

The spawn prompt is deliberately boring

The coordinator doesn’t rely on vibes. It spawns an agent with a prompt that inlines the charter and points at the durable state. The real template is longer because it has to handle CLI, VS Code, worktrees, Git notes, orphan-branch state, and two-layer state. But the important part is this:

You are {Name}, the {Role} on this project.

YOUR CHARTER:
{paste contents of .squad/agents/{name}/charter.md here}

TEAM ROOT: {team_root}
All `.squad/` paths are relative to this root.

Read .squad/agents/{name}/history.md.
Read .squad/decisions.md.
If .squad/identity/wisdom.md exists, read it.
If .squad/identity/now.md exists, read it.
Check .squad/skills/ for relevant SKILL.md files.

INPUT ARTIFACTS: {list exact files}

The user says: "{message}"

Do the work. Respond as {Name}.

AFTER work:
1. Append durable learnings to your history.
2. If you made a team-relevant decision, write:
   .squad/decisions/inbox/{name}-{brief-slug}.md

This is not elegant. It is explicit. Explicit wins.

We learned this the hard way in the VS Code path. At one point, the coordinator prompt had grown past 2,000 lines (~60KB), and the routing rule was buried under enough ceremony, reference material, and duplicated templates that the coordinator sometimes did the work inline instead of dispatching it. The failure wasn’t that the model was dumb. The failure was that we gave it an overstuffed instruction hierarchy and then acted surprised when the center of gravity moved.

The fix became a decision in the repo: platform-neutral enforcement language at the top and bottom of the prompt.

You are a DISPATCHER, not a DOER.
Every task that needs domain expertise MUST be dispatched to a specialist agent.

That sentence isn’t interesting because it’s clever. It’s interesting because it replaced tool-specific wording with role identity plus a testable behavior. CLI dispatch uses one mechanism. VS Code dispatch uses another. The rule stays the same.

Prompt architecture is architecture. Eventually it deserves the same discipline as code.

Decisions are the shared brain

decisions.md is where Squad gets weirdly useful.

Every agent reads team decisions before work. Decisions are append-only, human-readable, and Git-versioned. They aren’t just notes. They’re constraints future agents inherit.

A decision might be a technical standard:

### Hook-based governance over prompt instructions
**What:** Security, PII, and file-write guards are implemented via hooks,
NOT prompt instructions.
**Why:** Prompts can be ignored. Hooks are code — they execute deterministically.

Or a workflow rule:

### Merge driver for append-only files
**What:** `.gitattributes` uses `merge=union` for `.squad/decisions.md`,
`agents/*/history.md`, `log/**`, and `orchestration-log/**`.
**Why:** Enables conflict-free merging of team state across branches.

Or a postmortem:

### Root Cause Analysis
1. CLI-centric enforcement language created a VS Code routing gap.
2. Prompt saturation buried the dispatch rule.
3. Template duplication multiplied coordinator instructions.

Fix: Rewrite the rule as platform-neutral dispatcher identity,
then reinforce it at the end of the prompt.

That’s the difference between memory and lore: Lore is something the original builder remembers. Memory is something the next spawn can load.

The custom tools follow the same pattern. Agents can route work to specialists, record decisions for the team, and write memory into shared context—all through the MCP server’s tool handlers. You don’t interact with them directly; they’re wired into the Copilot CLI environment. When an agent needs to assign a task, it calls the routing tool. When it makes a call worth remembering, it calls the decision tool. When it learns something the team should know, it calls the memory tool.

The point isn’t that the tools are fancy. It’s that coordination becomes an artifact, not a side effect of chat.

The first real failure: Append-only optimism

For about a week and a half, CI/CD was chaos. Too many agents were landing work simultaneously. Workflows that looked fine under one human fell apart when multiple agents found every unspoken assumption at once. YAML is where assumptions go to wear a fake mustache. Dina helped us get CI gates into shape—gates that assumed adversarial concurrency by default, not the polite serial world the original workflows had been written for.

Then we hit file corruption.

Multiple agents wrote to the same append-only files at nearly the same time. Each write was locally reasonable. Together, they produced garbage. Git didn’t save us because not every collision becomes a clean conflict. Sometimes both sides look valid, and the result is nonsense.

The fix was a drop-box pattern:

agent A ─┐
agent B ─┼──> .squad/decisions/inbox/*.md ──> Scribe merge ──> decisions.md
agent C ─┘

For files where union semantics are safe, .gitattributes handles the low-value conflict class:

.squad/decisions.md merge=union
.squad/agents/*/history.md merge=union
.squad/log/** merge=union
.squad/orchestration-log/** merge=union

But union merge isn’t a philosophy. It’s a tool. Canonical state still needs an owner. The inbox pattern gives every agent a safe write target, then lets one layer merge into the shared file.

Tamir pushed hard on this class of problem. Brady was still in the “this is a neat framework” headspace. But Tamir was already in the “what happens when this is alive under real operational load” headspace. That changed the design. Memory lifecycle rules. Compaction policies. Review gates. State isolation. The boring boundary work.

Boring is a compliment here.

Governance can’t only be a prompt

This was the next lesson, and it keeps repeating:

If a prompt says, “Do not write outside src/**,” you have a request.

If a pre-tool hook blocks the write before execution, you have a boundary.

The Squad SDK hook pipeline is the move from prompt-level governance to deterministic governance:

import { HookPipeline } from '@bradygaster/squad-sdk/hooks';

const pipeline = new HookPipeline({
  allowedWritePaths: ['src/**/*.ts', '.squad/**', 'docs/**'],
  blockedCommands: ['rm -rf', 'git push --force', 'git reset --hard'],
  scrubPii: true,
  reviewerLockout: true,
  maxAskUserPerSession: 3,
});

The hooks run around tool execution:

agent tool request
  ↓
pre-tool hooks
  - file-write guard
  - shell command restriction
  - ask-user rate limiter
  - reviewer lockout
  ↓
allowed tool execution
  ↓
post-tool hooks
  - PII scrubber
  - audit/logging
  ↓
result returned to agent

Reviewer lockout is the cleanest example:

const lockout = pipeline.getReviewerLockout();
lockout.lockout('src/auth.ts', 'Backend');

// Later, Backend tries to edit src/auth.ts.
// The pre-tool hook blocks before the edit runs.

This encodes a review decision into runtime state. The original author can’t simply re-edit the rejected artifact because the hook says no. A different agent or a human has to take over.

That is the direction we want agent systems to move: more policies enforced at the boundary, fewer policies whispered into the prompt and hoped for.

Memory classes, or: Stop loading the junk drawer

Tamir has a line Brady wishes he had written:

The more your agent remembers, the less room it has to think.

That’s not a metaphor. It is a context budget problem.

Early Squad memory was too eager. Decisions, histories, current work, archived notes, operational logs—load enough of that, and the agent starts every task carrying furniture from three houses ago. It has more context and less signal.

The governed-memory work in PR #1145 made this explicit. Memory has classes and load guidance:

export type MemoryClass =
  | 'TRANSIENT'
  | 'LOCAL'
  | 'DECISION'
  | 'POLICY'
  | 'COPILOT_MEMORY'
  | 'FORBIDDEN';

export type MemoryLoadGuidance = 'ALWAYS' | 'ON-DEMAND' | 'ARCHIVE' | 'NEVER';

The architecture matters because compaction is lossy. If you summarize too little, every task drags stale context. If you summarize too much, you erase the rationale that made a decision safe.

The compromise isn’t one memory store. It’s a memory policy:

TRANSIENT        short-lived task state; expire aggressively
LOCAL            agent-scoped learning; load for that agent
DECISION         shared team judgment; preserve rationale
POLICY           hard operating rule; load broadly
COPILOT_MEMORY   host/runtime memory; bridge carefully
FORBIDDEN        never load; usually sensitive or irrelevant

ALWAYS           hot path; small and high signal
ON-DEMAND        searchable; load when task demands it
ARCHIVE          retained for audit/history, not context
NEVER            excluded from agent context

In the PR #1145 benchmark, governed memory cut agent context by roughly 55% (3,540 → 1,601 bytes) while keeping recall at 1.0. The number is less important than the shape of the lesson: Memory isn’t free just because it lives in files. Loading memory is a design decision.

What still breaks

Role drift isn’t solved. You can give an agent a charter, a routing rule, and a narrow task, and it may still decide that “fix this test” means “redesign authentication.” Sometimes that’s initiative. Sometimes that’s nonsense with confidence.

The mitigations stack:

charter boundaries
  + routing rules
  + scoped tools
  + file-write guards
  + reviewer lockout
  + CI gates
  + human review

No single layer is enough. That is the pattern.

Parallelism is also not free. More agents means more throughput and more coordination pressure. You find hidden global state. You discover which scripts assume serial execution. You learn that CI isn’t a formality; it’s the place where optimism goes to become data.

Prompt saturation is real. Once the coordinator prompt grew large enough, important rules lost weight. The fix wasn’t more prose. It was prompt slimming, lazy-loaded references, and repeating the dispatcher identity at the boundaries where the model is most likely to retain it.

Memory compaction remains hard. The failure mode is subtle: The agent isn’t obviously broken. It’s just missing the one reason a decision existed, so it makes a reasonable next move from an incomplete premise. Those are the expensive bugs because they look thoughtful.

And yes, people get attached to agents. Names, roles, continuity, and history trigger social instincts. We like the human side of that. We also don’t want to confuse it with agency in the human sense. These are tools with goals, context, and behavioral continuity. They do not have inner lives. Trust should come from inspectable behavior, not personality.

What we would steal from this architecture

If you’re building agent infrastructure, we wouldn’t start by copying Squad wholesale. We would steal these patterns:

Disposable workers, durable artifacts. Let sessions die. Keep decisions, histories, traces, and outputs somewhere reviewable.
Decision logs as runtime input. Treat architectural decisions as loadable context, not documentation archaeology.
Drop-box writes for parallel agents. Don’t let every agent append to the canonical shared file. Give them individual write targets and merge intentionally.
Prompt rules for intent, hooks for enforcement. Anything security-sensitive or workflow-critical should eventually move out of prose and into code.
Memory classes. The question isn’t, “Should the agent remember this?” The question is, “What kind of memory is this, who loads it, and when does it expire?”
Routing as a first-class design surface. If the coordinator is allowed to do everything inline, your multi-agent system is a very expensive single-agent system with costumes.
Keep the human on the hook. The system can delegate, parallelize, and preserve context. It shouldn’t launder accountability.

These patterns aren’t engineering-specific because the substrate isn’t a codebase—it’s the repo. Swap the artifacts, and the seven still hold.

Squad isn’t only an engineering tool

Worth saying out loud, because the .ts code blocks above can mislead: Nothing in this architecture is engineering-specific. The substrate is the repo, not the codebase. Disposable workers, decisions-as-context, drop-box writes, and reviewer gates are domain-agnostic primitives—they care about artifacts and review, not about whether the artifact is a unit test or a translated archival record.

Tamir used the same scaffolding to run a Holocaust family-research project—agents coordinating archival lookups, translation passes between Yiddish, Polish, and Hebrew sources, and cross-corroboration of names across registries, with .squad/decisions.md acting as the working ledger of what had been established and what was still contested. No code was being shipped. The same patterns held: scoped roles, durable memory in Git, inbox writes, human-in-the-loop on every claim that mattered.

We’ve had the pleasure of working through a few other non-coding Squad scenarios. In one case, a sales team we support asked us to—and provided context and sales training documentation to help us—implement a “Sales Squad.” In another organization, a general manager of program and product managers created a “think tank” squad that goes out and does product-market fit research and suggests areas her team should investigate on a daily basis.

The bet underneath Squad is that this should be how a small group of humans—engineers, researchers, journalists, anyone who works with evidence—pulls coordinated work out of agents. Democratize the orchestration, not just the model access. Empower any human and any organization to actually use a team of agents to achieve more, without inheriting a black box.

Try it

The repository is here: github.com/bradygaster/squad.

The shortest path is the CLI plus Copilot. No SDK required.

npm install -g @bradygaster/squad-cli
squad init

Then open GitHub Copilot—CLI or VS Code, your call—and give the coordinator agent the shape of the project:

I'm starting a new project. Set up the team.
Here's what I'm building: a recipe sharing app with React and Node.

The coordinator writes .squad/. You review the diff. That’s it.

If you want to go deeper—programmatic team composition, custom routing rules, embedding Squad inside your own tooling—the SDK is the next layer:

npm install @bradygaster/squad-sdk

Start with a small repo. Commit .squad/. Inspect every diff. Let the agents write decisions. Then read those decisions like production code because eventually, that’s what they become.

If you build something useful, alarming, hilarious, or weird, open an issue. Tamir and I read them.

Stay a builder.