Where the agent decides, and where the tools actually run

“The price of reliability is the pursuit of the utmost simplicity.”

– C. A. R. Hoare, Turing Award lecture, 1980

The agent demos all look beautiful. You ask the friendly chatbot a question, it thinks for a moment, it gives you an answer. Sometimes the answer is even right.

Then someone says the thing that ruins the demo:

“Can we get it to actually do things—and trust it to do them?”

The moment “do things” is in scope, the architecture problem changes. Now the agent needs to run code. It needs files. It needs network access. It needs a credential to call the next model. It needs to remember what it did yesterday. The friendly little chat suddenly has a workspace, a shell, a token, and a very real ability to break something expensive.

This post outlines the architecture I want around that agent before I let it loose: a LangGraph factory that talks to a Squad coordinator for judgment, then dispatches the dangerous parts to two different Azure Container Apps primitives: one for one-shot work, one for stateful work. The whole thing proved itself end to end last week, which is why I’m writing it down now.

And this isn’t a hypothetical I built to have something to write about. The triggering event was a real team that turned up wanting to use Squad in production—and, interestingly, they were a Node.js shop. They had TypeScript. They had LangGraph. They had a package-lock.json that had clearly earned the right to be respected. What they did not have was the Microsoft Agent Framework, which is inconvenient for my C# heart. They wanted Squad inside their existing application, and the answer couldn’t be, “Please rewrite your product in C# first.”

So the question stopped being whether Squad is nice and became a harder one: Where exactly does a judgment step go inside an app that already has a deterministic state machine, a tool surface, a CI pipeline, and a product manager who would prefer the demo not catch fire? That’s a different question from the last piece Brady Gaster and I wrote for Command Line, which was about what survives an agent session—make the agents disposable, keep the memory in Git. This post is about where the agents go and where their tools actually run.

The shape I backed into has three layers: a brain that decides, two different pairs of hands that do the work, and a memory that carries the evidence from one step to the next. That is the first-order picture. The second-order detail—and it’s the one that actually matters—is that one of those two pairs of hands can hold a brain of its own.

The three-layer shape: a LangGraph orchestrating brain that decides the flow, an ACA tool plane that runs the dangerous and stateful work, and graph state that carries the evidence between them. The orchestrating brain stays deterministic and never holds a shell; when a model needs to think with one, that thinking is sealed inside a sandbox.

The three problems agents create the moment they get hands

A chat agent has none of these problems, which is why chat agents are easy while agents that do things are hard.

Problem one is non-determinism. A model is great when you want it to weigh tradeoffs in a design document. It’s terrible when you want it to decide whether step three of a workflow should happen before or after step four. Workflows are product decisions. The order of operations doesn’t need a creative reinterpretation on every run.

Problem two is dangerous code. The moment the agent can run a shell, it can also run the wrong shell. It can wipe the wrong directory. It can pip-install a package it found on a sketchy index. It can pull a token from an environment variable and quietly post it somewhere that isn’t yours. None of this is malice. It is what happens when a probabilistic process gets a deterministic side effect.

Problem three is state across steps. A useful agent for non-trivial work needs a workspace. It checks out a repo, installs a toolchain, opens files, runs an analysis. The result of step one is the input to step three. If the workspace dies with the call, nothing accumulates. If it survives across calls, you have a different set of problems—but at least the right shape for the work.

Three problems. They don’t all want the same solution. The trick is to give each one its own.

The brain: A deterministic graph with one judgment node

LangGraph is the brain in this design because it is deterministic where I want determinism. It decides what runs, in what order, with what state, and what happens if a node fails. It does not invent steps. It does not improvise the workflow on each run. It is boring in exactly the right way.

Everything below runs on a sample I keep calling the factory, so it’s worth 30 seconds on what it models. The use case is an internal software factory: the shared platform team a large enterprise stands up so its business groups don’t each invent their own stack from scratch. A group shows up with an idea and a rough set of requirements. The factory reviews them against the organization’s approved technologies and best practices, rewrites the parts that don’t comply, folds in the operational signals the team supplied, and hands back a tech design the group can actually build from. It is small enough to read end to end and real enough to exercise every layer in this post.

The factory sample has seven nodes in a straight line: intake normalization, a standalone reviewer agent, deterministic stack fixes, the Squad design step, the Dynamic Sessions signal-analysis step, the ACA Sandbox workspace step, and a final assembler that produces one markdown design document.

One run through the seven-node factory graph, from intake to the final design document. Six nodes are plain TypeScript or a single bounded AI call; the one in the middle—squadTechDesign—is where judgment is allowed to drive.

Six of those nodes are uncontroversial—plain TypeScript or a single bounded SDK call. The interesting one is the design step.

Wiring the seven nodes is the boring half. LangGraph’s StateGraph takes a typed annotation and a list of addNode / addEdge calls and gives you back a compiled, synchronous-looking graph the rest of the app can invoke. The graph below is the whole orchestration layer; there is no other dispatcher anywhere in the codebase.

// src/graph.ts  (squad-langgraph-aca:wip) 
return new StateGraph(FactoryStateAnnotation) 
  .addNode("intakeNormalize", intakeNormalize) 
  .addNode("reviewerAgent", reviewerAgent) 
  .addNode("applyApprovedStackFixes", applyApprovedStackFixes) 
  .addNode("squadTechDesign", squadTechDesign) 
  .addNode("runDynamicSession", runDynamicSessionNode) 
  .addNode("runSandboxWorkspace", runSandboxWorkspaceNode) 
  .addNode("assembleDesign", assembleDesign) 
  .addEdge(START, "intakeNormalize") 
  // ... linear edges in the same order ... 
  .addEdge("runSandboxWorkspace", "assembleDesign") 
  .addEdge("assembleDesign", END) 
  .compile();

The state every node reads and writes is one typed shape. Each node’s return value is a partial update—LangGraph merges it into the running state for the next node to read. No globals. No shared mutable singletons. The whole “memory between steps” story lives in one annotation declaration:

// src/graph.ts  (squad-langgraph-aca:wip) 
const FactoryStateAnnotation = Annotation.Root({ 
  request:           Annotation<FactoryRequest>(), 
  normalizedRequest: Annotation<NormalizedRequest | undefined>(), 
  dispatches:        Annotation<DispatchRecord[]>(), 
  findings:          Annotation<Finding[]>(), 
  designSections:    Annotation<DesignSection[]>(), 
  signalArtifact:    Annotation<DynamicSessionArtifact | undefined>(), 
  reviewArtifact:    Annotation<SandboxReviewArtifact | undefined>(), 
  finalDesign:       Annotation<string | undefined>() 
});

That is the node where the brain hands the current graph state to a Copilot SDK session, registers a custom agent named squad, lets that agent use the repo-local team context as its working memory, and waits for a typed dispatch record. The internal Squad members never appear in the LangGraph state. The brain sees one public custom agent and one structured result. The complexity stays behind that one door.

The Copilot SDK call that opens that one door looks like this. The load-bearing lines are customAgents, agent: squadAgentName, and the structured sendAndWait return—the SDK gives you a way to register a named custom agent, pre-select it for the turn, and read back its assistant message as typed data. The brain never sees the agent’s internal reasoning, only its declared output:

// src/squad/copilotSdkCustomAgents.ts  (squad-langgraph-factory) 
const client = options.clientFactory?.(clientOptions) ?? new CopilotClient(clientOptions); 
await client.start(); 
 
const customAgents = await createCustomAgents(repoRoot, input); 
const tools        = createFactoryTools(captured); 
 
session = await client.createSession({ 
  clientName: "squad-langgraph-factory", 
  model: options.model ?? "claude-sonnet-4.5", 
  workingDirectory: repoRoot, 
  tools, 
  availableTools: createAvailableTools(), 
  customAgents, 
  agent: squadAgentName,           // HERE: "squad" is the only public-facing agent 
  onPermissionRequest: approveAll, // tools are deterministic, so blanket-approve is safe 
  // ...systemMessage, skipCustomInstructions, includeSubAgentStreamingEvents... 
}); 
 
if (session.rpc?.agent?.select) { 
  await session.rpc.agent.select({ name: squadAgentName });   // belt + braces: pre-select 
} 
 
finalResponse = await session.sendAndWait( 
  { prompt: createNodePrompt(input) }, 
  options.timeoutMs ?? 120_000 
); 
return buildDispatchRecord(input, captured, finalResponse);

The custom agent itself is one entry—a name, a prompt loaded from .github/agents/squad.agent.md, and a hard-coded tool allowlist. Squad’s internal members and routing files (.squad/team.md, .squad/routing.md, .squad/agents/*) are passed in as internal context to that one agent; they are never registered as SDK agents the graph could select on its own:

// src/squad/copilotSdkCustomAgents.ts  (squad-langgraph-factory) — createCustomAgents 
return [{ 
  name: squadAgentName,                       // "squad" 
  displayName: "Squad Coordinator", 
  description: "...returns public-safe typed technical design output.", 
  tools: squadCoordinatorToolAllowlist,       // readApprovedStack, validateTechnology, recordFinding, draftSection 
  infer: true, 
  prompt: [coordinatorPrompt.trim(), "...", "Current graph-node context:", stateContext].join("\n") 
}];

And the thing the brain ultimately reads back from that judgment node is a typed DispatchRecord—a contract Squad can fill but can’t widen. Findings and sections come back as plain data the next node can iterate over:

// src/types.ts  (squad-langgraph-factory) 
export type DispatchRecord = { 
  member: SquadMemberName;           // "squad" for the LangGraph-facing seam 
  objective: string; 
  allowedTools: MemberToolName[]; 
  findings: Finding[]; 
  sections: DesignSection[]; 
};

That is the first architectural rule: Judgment is the only thing the brain delegates to a model. Everything else is code.

The reviewer step is a single-purpose SDK custom agent that reads messy requirements and writes findings into state. The deterministic substitution step (MySQL becomes Azure SQL, Auth0 becomes Microsoft Entra ID) is not an agent at all. It’s a switch statement. We don’t need a probabilistic process to pretend to be a switch statement.

// src/graph.ts  (squad-langgraph-aca:wip) — applyApprovedStackFixes 
const fixes: ApprovedStackFix[] = []; 
const requestedTechnologies = source.requestedTechnologies.map((technology) => { 
  const replacement = replacementMap[technology.toLowerCase()]; 
  if (!replacement) return technology; 
  fixes.push({ from: technology, to: replacement, reason: `${technology} is outside the approved sample stack.` }); 
  return replacement; 
});

This is the brain: a state machine with a known shape, one judgment node, and typed outputs the next nodes can read.

The first pair of hands: ACA Dynamic Sessions

The Squad node produces design sections. The next thing the design needs is a signal analysis on the operational signals the team supplied. That work is small, stateless, and isolated by definition.

ACA Dynamic Sessions is the right primitive for that lane.

Think of Dynamic Sessions as a pool of pre-warmed containers the platform spins up on demand. You call a run endpoint, attach an opaque high-entropy identifier, and the platform routes the call to a fresh container. When the call finishes, the container is torn down. There is no “yesterday” in this lane, and no shared file system across runs.

The whole client is a single POST with an Entra-issued bearer token and an opaque identifier the platform uses as the routing key. The identifier is generated per run and has no meaningful content—the pool uses it to map calls to containers; the app never reuses it:

// src/dynamicSessions/AzureDynamicSessionsClient.ts  (squad-langgraph-aca:wip) — Run() 
const token = await this.Credential.getToken(DynamicSessionsScope); 
const url   = this.BuildSessionUrl("/run");     // appends ?identifier=<opaque> 
const response = await this.FetchFn(url, { 
  method: "POST", 
  headers: { 
    authorization: `Bearer ${token.token}`, 
    "content-type": "application/json", 
  }, 
  body: JSON.stringify({ 
    sessionId: request.SessionId, 
    task: { kind: "analyzeSignals", input: { /* tenantId, product, region, signals */ } }, 
  }), 
});

The sample wires Dynamic Sessions to its own custom container, not the generic Python sandbox. The container exposes two routes: a health check and a run endpoint. The body cap is 64 KB. The handler doesn’t call a shell. It doesn’t call a model. It runs one deterministic function that scores the input against a small keyword map and returns a typed artifact. The pool runs with egress disabled at the platform level—even if the worker decided to phone home, the call would not resolve.

// worker/src/server.ts  (squad-aca-dynamic-sessions) 
const maxRequestBytes = 64 * 1024; 
 
const server = createServer(async (req, res) => { 
  if (req.method === "GET" && req.url === "/health") { 
    return sendJson(res, 200, { ok: true, service: "squad-aca-dynamic-sessions-worker" }); 
  } 
  if (req.method === "POST" && req.url?.startsWith("/run")) { 
    const body = await readJson<SandboxRunRequest>(req);              // 64 KB cap enforced in readJson 
    return sendJson(res, 200, executeDeterministicTask(body, "worker")); // no shell, no model 
  } 
  sendJson(res, 404, { ok: false, error: "Not found. Use GET /health or POST /run." }); 
});

The egress-disabled story is one flag on the pool itself. The whole “even if the worker decided to phone home” guarantee comes from that one line of provisioning—the worker container has no route to anywhere outside the pool:

# infra/create-session-pool.sh  (squad-langgraph-aca:wip) 
az containerapp sessionpool create \ 
  --name "<SESSION_POOL_NAME>" \ 
  --container-type CustomContainer \ 
  --image "<ACR_LOGIN_SERVER>/squad-aca-dynamic-sessions-worker:<TAG>" \ 
  --cpu 0.25 --memory 0.5Gi --target-port 8080 \ 
  --network-status EgressDisabled \    # <-- the load-bearing line: no outbound network from the pool 
  --max-sessions 10 --ready-sessions 1

That’s more constrained than people expect when they hear “Dynamic Sessions.” The pool can run a model-driven shell if you want one. The sample explicitly does not, because the brain already does that work in the Squad node. The Dynamic Sessions lane is the place I want code, not judgment.

The local development mode runs the same function in-process—same code, different transport—so the local demo produces an identical artifact to the Azure demo. The pool is not a stub. It’s the same logic running far from your laptop.

Dynamic Sessions, in one line: stateless, one-shot, deterministic. Perfect for work that should never come back.

The second pair of hands—that can hold a brain: ACA Sandboxes

The other ACA primitive is doing a different job, and it took me a minute to internalize that it isn’t just “Dynamic Sessions, but bigger.”

ACA Sandboxes give you a persistent microVM. Real file system. Real process tree. Whatever toolchain the disk image includes—the GitHub CLI, npm, the Copilot CLI, whatever you bake in. You can suspend it. You can resume it. The state survives across calls, which is exactly what an agent that wants a workspace needs.

The sample treats that microVM as a workspace and exposes a TypeScript wrapper with only three verbs. Execute one named command. Capture a snapshot. Suspend. There is no run-shell. There is no write-file. There is no fetch-URL. The interface is deliberately too narrow to let the graph invent a new shell command at runtime.

// src/sandboxes/AzureSandboxWorkspace.ts  (squad-langgraph-aca:wip) 
export class AzureSandboxWorkspace implements SandboxWorkspace { 
  ExecCommand(commandId: string): Promise<SandboxExecResult>;  // pick a catalog id; never build shell 
  CaptureSnapshot(): Promise<string | null>; 
  Suspend(): Promise<void>; 
}

The body of ExecCommand is the boundary. It resolves the id through GetCommand, shells out via the aca CLI with the catalog’s pre-built shell string, and runs the captured stdout/stderr through a redactor before anything leaves the wrapper. The graph never sees raw subprocess output:

// src/sandboxes/AzureSandboxWorkspace.ts  (squad-langgraph-aca:wip) 
async ExecCommand(commandId: string): Promise<SandboxExecResult> { 
  const command = GetCommand(commandId);          // throws on unknown id (allowlist) 
  const argv = [ 
    "sandbox", "exec", 
    "--group", this.SandboxGroup, "--id", this.SandboxId, 
    "--command", command.Shell,                   // pre-built shell from the catalog 
  ]; 
  const result = await this.RunAca(argv); 
  return { 
    CommandId: commandId, 
    Stdout: RedactText(result.Stdout),            // redact Bearer / gh_* / JWT-like before egress 
    Stderr: RedactText(result.Stderr), 
    ExitCode: result.ExitCode, 
  }; 
}

The “named command” part is where the safety story lives. The sandbox lane ships with a command catalog of five entries: prepare workspace, analyze workspace, read artifact, inspect toolchain, and one I will come back to in a minute called the Copilot prompt proof. Each entry has a stable id and a fixed shell string. The graph picks an id; the wrapper resolves it and shells out. If the id isn’t in the catalog, the lookup throws with the allowlist in the error message. No fuzzy match. No fallback shell. No “execute anyway.”

// src/sandboxes/commandCatalog.ts  (squad-langgraph-aca:wip) 
export const Commands: Record<string, SandboxCommand> = { 
  prepare_workspace:    { CommandId: "prepare_workspace",    Description: "Create deterministic workspace input files.",         Shell: BuildPrepareWorkspaceShell() }, 
  analyze_workspace:    { CommandId: "analyze_workspace",    Description: "Run a deterministic analysis; write review artifact.", Shell: BuildAnalyzeWorkspaceShell() }, 
  read_artifact:        { CommandId: "read_artifact",        Description: "Read the generated review artifact.",                  Shell: BuildReadArtifactShell() }, 
  inspect_toolchain:    { CommandId: "inspect_toolchain",    Description: "Inspect copilot/gh/node/npm/squad versions.",          Shell: BuildInspectToolchainShell() }, 
  copilot_prompt_proof: { CommandId: "copilot_prompt_proof", Description: "Phase 3 acceptance: real Copilot prompt inside the sandbox using the sandbox-group credential.", Shell: BuildCopilotPromptProofShell() }, 
}; 
 
export function GetCommand(commandId: string): SandboxCommand { 
  const command = Commands[commandId]; 
  if (!command) { 
    const allowed = Object.keys(Commands).sort().join(", "); 
    throw new Error(`Unsupported sandbox command '${commandId}'. Allowed: ${allowed}`); 
  } 
  return command; 
}

That is the most important rule of the sandbox lane: The orchestrating brain picks command ids; it never builds shell. Every shell string that runs inside the sandbox was authored by a human, lives in source control, and can be diffed in a pull request. The probabilistic process selects from a menu. The menu doesn’t change at runtime.

Adding a capability means adding a new entry, not teaching an existing one a new trick. In return, you get an audit log that is actually useful—which ids ran, in what order, with what exit codes. The driver stops on the first non-zero exit, runs the suspend in a try/finally so a thrown exception still releases the billable compute, and pushes captured stdout through a redactor before it ever leaves the wrapper. Nothing leaks across the boundary through process state or scratch files.

// src/sandboxes/runSandboxWorkspaceNode.ts  (squad-langgraph-aca:wip) 
const results: SandboxExecResult[] = []; 
let reviewBody = ""; 
try { 
  for (const commandId of plan) {                  // plan from DefaultCommandPlan() 
    const result = await workspace.ExecCommand(commandId); 
    results.push(result); 
    if (commandId === "read_artifact" && result.ExitCode === 0) reviewBody = result.Stdout; 
    if (result.ExitCode !== 0 && commandId !== "inspect_toolchain") break;   // stop-on-nonzero 
  } 
} finally { 
  if (options.SuspendAtEnd ?? mode === "azure") await workspace.Suspend();   // always release compute 
}

And here is where the brain-and-hands metaphor would mislead you if you took it too literally. A sandbox is not a mindless pair of hands. One of those catalog commands can start a real Copilot session inside the microVM—a Squad agent that reads, reasons, and decides, holding a shell and a credential and a workspace, the exact things I spent the first half of this post keeping away from the orchestrator. That is not a contradiction. It is the whole point. The orchestrating brain stays deterministic and shell-free; and when you genuinely need a model to think with dangerous capabilities, you don’t hand them to the control plane—you seal that thinking inside a sandbox, where a second brain can think and act in a room locked from the outside. The hands can hold a brain. It just has to be a contained one.

Two pairs of hands, different shapes, different jobs

Both of these are containers. Both are managed by ACA. Both have egress controls. They are not interchangeable.

	Dynamic Sessions	ACA Sandboxes
Lifecycle	One-shot. Torn down after the call.	Persistent microVM. Suspend, resume.
State across calls	None. Each call is fresh.	File system, processes, toolchain survive.
What the orchestrator sends	A typed JSON body.	A catalog command id.
What runs inside	One deterministic function.	A pre-built shell from the catalog—sometimes a whole agent.
Can a brain think inside?	No. Pure execution.	Yes—a sealed, contained one.
Right job	Stateless deterministic work.	Long-lived workspace. Installed CLIs. Agent prompts that need credentials.

Picking the right one for each lane is most of the architecture. The factory sample uses both because the workflow has both kinds of work in it. The signal analysis is one-shot. The workspace review is multi-step. Forcing them into the same primitive would either drag stateless work into a persistent sandbox (and pay for compute you don’t need) or drag stateful work into a one-shot container (and lose the artifacts the next call needs to see).

Use the right pair of hands for the job.

What a single run actually looks like

A user gives the factory a request. The team is Contoso Field Apps. The goal is a regional intake app with an approval workflow. The proposed stack is Power Apps, MySQL, Auth0, Power Automate. There are constraints—must use approved identity, must produce an auditable design—and operational signals that read like things an actual ops team would say. “Save button timeouts spike on Friday afternoons” is in there, because somebody’s Friday afternoons are always like that.

The graph runs.

Intake normalization tidies up whitespace. The reviewer agent reads the request and writes findings into state—it notices “audit” was requested but no retention period was specified. A deterministic step rewrites the stack list: MySQL becomes Azure SQL, Auth0 becomes Microsoft Entra ID. The fixes get appended to state so the design can show its work later.

Now the brain hits the judgment node. The Squad design step opens a Copilot SDK session, registers the squad custom agent with its bounded tool allowlist, pre-selects it for the turn, and sends the current state as context. Squad reads its team context, decides which sections to draft, calls the deterministic tools, and returns a typed dispatch record.

Now the tool plane. The Dynamic Sessions node signs an Entra token, calls the worker pool, and the worker scores the signals against the keyword map. The pool returns a typed signal artifact—risk score, findings by category, recommended next step.

The Sandbox node does the longer story. It resolves the sandbox group and id, then walks the catalog: prepare the workspace, analyze it, read the artifact back. Each step shells out with a pre-built shell. The captured stdout is redacted on the way out. The driver suspends the sandbox in a finally.

Finally the assembler runs. It reads everything from state and produces a five-section markdown document with the stack substitutions, the Squad-drafted sections, and both ACA artifacts as evidence the work was actually done. Every claim has a node that produced it and a typed value behind it.

That’s a normal run. The boring kind. The only kind I want from anything that talks to production.

The punchline: Where the credential actually lives

The hardest problem in agent execution isn’t, “Where do we run shell?” The container model solves that. The hardest problem is, “How does the agent inside the sandbox prove it has an identity to call out to a model with, without that identity ending up somewhere your application can read it?”

There are three approaches that look reasonable and are wrong.

You can bake the token into the disk image. Build the image with a Copilot credential directory already populated, push it to your registry, point the sandbox at it. It works. Until the token rotates. Until someone pulls the image from the registry cache and grabs the layer with the credential in it. The token’s blast radius is now whoever has read access to the registry, which is almost always a much bigger set than whoever should hold the credential.

You can copy the Copilot credential from the host at provisioning. Cleaner. The image stays neutral. But now the token sits on the sandbox’s disk, indistinguishable from any other workspace file. It’s in snapshots. It’s in memory dumps. A misconfigured catalog command that lists files puts the path in stdout, which the redactor catches sometimes and misses other times. The credential lives in user-visible state, and user-visible state has a hundred ways to leak.

You can pass it via an environment variable. Lowest effort. Highest risk. Environment variables are inherited by every child process. They show up in process introspection. They survive in coredumps. The moment a tool dumps the environment for debugging, the credential lands in the artifact that goes back to graph state.

All three share the same flaw: The credential is in the application’s data plane. Whatever read access the application has, the credential effectively has. That is the model ACA Sandboxes is designed to break.

The right answer is to lift the credential one layer up—onto the sandbox group itself, which is its own Azure resource with its own role assignments and its own audit trail.

The provisioning recipe is three commands. Create the sandbox group. Attach a GitHub Copilot credential to it—the platform takes the secret material and gives you back an opaque credential id. Create a sandbox in the group, bind it to that credential id, and set egress to default-deny with three explicit allowances: github.com, api.github.com, and the Copilot API wildcard. The token never lands in the disk image. It never lands on the host filesystem. It never lands in an environment variable. The platform injects it at provisioning time through a path the application cannot enumerate. Rotation is a control-plane operation against the group; existing sandboxes pick up the new credential without redeployment.

# infra/create-sandbox-group.sh  (squad-langgraph-aca:wip) 
# 1. Create the sandbox group. 
aca sandboxgroup create --name "${GROUP}" --resource-group "${RG}" --region "${REGION}" 
 
# 2. Attach a credential. It lives on the group, not the disk image — so it is rotatable. 
aca sandboxgroup credential create --group "${GROUP}" --resource-group "${RG}" --type github-copilot 
 
# 3. Create a sandbox bound to the credential, default-deny egress, allowlist only what the model needs. 
aca sandbox create \ 
  --group "${GROUP}" --resource-group "${RG}" --disk copilot \ 
  --credential "${GITHUB_COPILOT_CREDENTIAL_ID}" \ 
  --egress-default Deny \ 
  --egress-rule "github.com:Allow" \ 
  --egress-rule "api.github.com:Allow" \ 
  --egress-rule "*.githubcopilot.com:Allow"

The default-deny egress is the other half. Without it, a leaked credential could still phone home anywhere. With it, the sandbox can only reach the three hosts the agent needs to talk to a Copilot model. The token is real, the prompt works, and there is nowhere else for it to go.

The proof that this composes is a small catalog command—the Copilot prompt proof I mentioned earlier. Print the CLI version. Run a single non-interactive prompt asking the model, “What is 2+2?” with instructions to reply with one integer and no prose. Capture the answer. Exit. The deterministic prompt is the smallest possible signal that the credential resolved, the egress allowed the call, and the model returned.

// src/sandboxes/commandCatalog.ts  (squad-langgraph-aca:wip) — BuildCopilotPromptProofShell 
return ( 
  "set -eu\n" + 
  "echo '== copilot CLI version =='\n" + 
  "copilot --version\n" + 
  "echo '== authenticated prompt (deterministic answer expected) =='\n" + 
  "timeout 90 copilot -p \"What is 2+2? Reply with a single integer and no prose.\" 2>&1 | tr -d '\\r' | tail -10\n" + 
  "echo '== proof complete =='\n" 
);

That command ran end to end against a real Azure subscription last week. The prompt resolved through the sandbox-group credential. Default-deny egress was active. The model returned a deterministic answer in eight seconds and 8.33 AI credits. The token was not in the disk image, not copied from the host, and not in any environment variable.

That is the punchline. The architecture is publishable now because the credential is out of reach of the data plane. Without that, you can have all the catalogs and all the egress rules and all the suspend lifecycles you want, and you’re still one log paste away from a leaked token.

Put the credential where the application can’t find it. Let the platform inject it. Default-deny the network. Run a tiny prompt to prove it works. Then sleep.

What’s shipped, what’s coming

A short status note, because I would rather you go look at the code than take my word for it.

Three of the four repos are public. The LangGraph + Squad baseline is squad-langgraph-factory. The Dynamic Sessions sibling is squad-aca-dynamic-sessions—the custom-container worker, the pool provisioning script, the TypeScript client. The ACA Sandboxes sibling is squad-aca-sandboxes-workspace, written in Python, with a sister catalog and a documented safety-gate model.

The unifier—the one repo where both ACA primitives wire into the same LangGraph graph and the seven-node flow runs end to end—is still private. It lives at squad-langgraph-aca; the integration is on a work-in-progress branch. Main is the imported baseline; wip is the real thing. The repo flips to public once the last roadmap phase lands—the boring one with the full README, the architecture diagrams, the run screenshots. The interesting work is done.

The pattern is portable. The brain doesn’t have to be LangGraph. The judgment node doesn’t have to be Squad. The lanes don’t have to be these two ACA primitives. What you need are four shapes: a deterministic state machine, a narrow judgment seam, two execution lanes for two kinds of work, and a credential model outside your application’s data plane.

Get those four right, and your agent can have hands without becoming the reason you carry a pager.

The durable asset is the loop you own. OpenEnv is its protocol.

Last year, agents finally got a standard way to use tools. MCP caught on fast because it solved something tedious and real: Every tool spoke its own dialect, and nobody wants to maintain the same integration 10 times over. Learning never got that treatment. An agent can call a tool, but there’s still no shared way for it to practice and get better at the actual job. OpenEnv goes after that gap, which is at least as big as the one MCP closed.

Jay Parikh made the case that what moves your business is the system around the model, not the model by itself. Satya Nadella put a finer point on it: The asset you keep isn’t the model you rent; it is the learning loop you own. That can land like a slogan, so here is the concrete version: The loop is an environment where your agent does the real work, a rubric that scores the outcome you actually care about instead of some proxy, rollouts you can repeat, and a way to turn those scores into a better agent. That part compounds. The model in the middle is the easy thing to swap.

“The winners won’t be those with the most demos, but those that turn AI into a governed, continuously improving system for running real work.”

– Jay Parikh, EVP of CoreAI, Microsoft

What you can’t just swap out is the rest of the loop, and its hard part is turning a score into a better agent. There are two ways to do that. One leaves the weights alone and reworks everything around the model: the prompt, the tools, the skills. The other retrains the model itself. Make a change either way, keep it only if it wins on tasks the agent hasn’t seen, and send that version back in. Each lap starts higher than the last. That is the hill-climbing loop, and the diagram below puts it on one page.

The hill-climbing loop. The full walkthrough, including the non-parametric vs. parametric split, is in the companion post on the Microsoft Foundry blog.

An environment is not a test harness

Most teams treat an environment as a test: Run the agent, read a score, move on. That undersells it. Codify the outcome you actually want, as a rubric, along with the workflow, the tools, and the constraints, and the environment stops being a test and becomes a learning system: The agent practices in it, gets scored against that outcome, and gets better with every run. What stands in the way is rarely the idea. It is the plumbing. Every trainer, runtime, and model expects the environment in a different shape, so every pairing becomes its own integration. OpenEnv removes that tax. One small contract (reset, step, state) gives the whole stack three properties it never had: open, because the standard is community-built; interoperable, because any model, trainer, or runtime can speak it; and modular, because you can swap any one of them without rebuilding the environment.

OpenEnv can become for agent learning what MCP became for tools and context.

So here is the claim, stated plainly: OpenEnv can become for agent learning what MCP became for tools and context. That’s a strong claim. It is also the right one, because it makes the environment, not the vendor, the unit of reuse. That’s why Microsoft joined OpenEnv alongside Hugging Face, Meta’s PyTorch team, NVIDIA, Prime Intellect, Unsloth, Modal, and others. OpenEnv isn’t a framework. It’s a protocol.

What it unlocks is ownership: private environments, private evals, repeatable rollouts, secure sandboxes, and optimization that isn’t married to one model or trainer. You stop calling a frontier model and hoping. You start owning the loop that makes an agent better at your work, and you keep that loop when the model underneath it changes.

The protocol only stays relevant if it absorbs the frontier

An open standard earns its place by pulling in research, not by sitting still. The clearest example we have shipped is a PR: ECHO world-modeling, landed as RFC 010, which brings a Microsoft Research result, “Terminal Agents Learn World Models for Free,” into OpenEnv where any team can use it (microsoft/echo-rl). A lab technique becomes a shared capability. That is how the loop gets democratized.

Here’s what it does: An agent transcript is half actions (what the model writes) and half observations (what the environment writes back). Standard agent-RL trains the actions and throws the observations away. ECHO keeps them: a small cross-entropy term that makes the policy predict the environment’s own tokens, a world model, from logits it already computed in the same forward pass. No extra rollouts, no teacher, no labels.

L = L_GRPO(action tokens) + λ · CrossEntropy(observation tokens)

ECHO in one step. One rollout, split by per-token role: Actions get the RL loss, observations get a λ-weighted cross-entropy loss, summed into a single optimizer step. λ = 0 is vanilla RL, so it is safe to adopt incrementally.

The discarded signal isn’t a rounding error. On a captured agent episode, 4,659 of 5,247 learnable tokens, 89%, are environment observations, 7.9 times the action tokens. Prime Intellect reaches the same place in “True Agents Model the World,” restating supervised learning on tool-response tokens as RL with a constant positive advantage, foldable in at no extra cost. Two groups, one direction: World-modeling belongs inside the RL loop, not bolted on afterward.

The honest version of the result is about generalization, not a magic number. With λ on versus off, training reward barely moves; held-out performance is where ECHO pulls ahead. Its published results: held-out pass@1 roughly doubles on TerminalBench-2.0, RL reaches its target about 2.3× faster, and it recovers 50% to 104% of expert-SFT with no teacher. Keep λ small and sweep it; the dense signal overfits if you push it.

What the weight update buys. Same training reward; held-out pass@1 roughly doubles. ECHO also reports about 2.3× faster RL and 50% to 104% of expert-SFT recovered with no teacher (arXiv 2605.24517, microsoft/echo-rl on SkyRL; corroborated by Prime Intellect).

You can watch it on a laptop in about 40 seconds. A small model on a deterministic toy terminal env drives held-out env-token cross-entropy toward zero. It reaches zero only because that toy world is fully predictable; a real environment keeps its irreducible entropy (near 4.4 nats), so ECHO sharpens predictions rather than perfecting them. The repo is open: OpenEnv, examples/echo_world_model, python train_echo.py --steps 60 --seed 0.

Reproduce it on CPU. A toy, fully deterministic terminal env, so cross-entropy can approach zero; a real env keeps its irreducible entropy instead. The held-out line bottoms near step 40 and then mildly overfits, which is why λ stays small.

And it survives the jump from a laptop to real training. Because supervised learning on the observation tokens is just RL with a constant positive advantage, there is no second loss function: You reuse the same forward_backward and add a small positive advantage on the environment tokens. One vector changes, and the same one-line config runs on the open SkyRL reference, on Tinker, and on managed post-training unchanged. We ran it live on a small Qwen model; the backend metrics came back namespaced skyrl.ai, the open reference stack running underneath.

The interesting part is what happens next

Once your workflow, tools, and rubric live in an OpenEnv environment, the same trace data that post-trains the model can improve the environment itself: curricula that generate harder tasks as the agent gets better, harness optimizers, new environments built from captured production traces. That is recursive self-improvement, and it is on the roadmap, not in a paper. The system writes its own next set of exercises, and each cycle sharpens the next. The learning stops living only in the weights and starts accruing in the gym, which is the part you own.

Start hill-climbing. The model should be swappable. The loop should be yours.

Take one real workflow, turn it into an OpenEnv-compatible environment with a clear outcome rubric, and start hill-climbing. The model should be swappable. The loop should be yours.

For the full walkthrough of the loop, the product details, and the non-parametric vs. parametric breakdown, see the companion post on the Microsoft Foundry blog.

Information-flow control: Moving toward secure, autonomous agents

When agents can take high-stakes actions like sending an email, sharing a business document, or opening a pull request, a single misstep has the potential to leak confidential data or hand control to an attacker that may then invoke tools that break security or cause damage. Today, we often manage that risk by putting a human in the loop to approve consequential actions. This scales poorly, erodes vigilance, and takes away the very autonomy that makes agents useful.

We lean on humans as a safeguard because the models driving agents behave stochastically, make mistakes, and could be steered by malicious content smuggled in through prompt injection. Despite progress in model alignment, contextual awareness, and content safety classifiers, security can’t depend solely on probabilistic mitigations. A good rule of thumb to keep in mind when designing an agentic system is that anything that an agent can do in response to a user prompt can also be accomplished by a model’s mistake or by an attacker with a prompt injection.

Anything that an agent can do in response to a user prompt can also be accomplished by a model’s mistake or by an attacker with a prompt injection.

A promising path towards secure and autonomous agents is through information-flow control (IFC), a deterministic security system built on three simple steps:

Label data. Every piece of data that an agent ingests carries labels for integrity (for example, trusted or untrusted) and confidentiality (for example, public, confidential, or a read-access list such as {Alice, Bob, Charlie}).
Propagate labels. As data flows into the agent loop and derivative results are produced, labels travel with them. Derived data is labelled conservatively with the least upper bound of its sources: a result influenced by an untrusted input stays untrusted, and a result based on two documents is readable only by principals who could read both source documents.
Check before acting. Before each tool call, a policy engine inspects the relevant labels and decides whether to allow the action, block it, or ask a human to review it.

This turns a probabilistic system into one with guarantees you can audit. Because the policy engine relies on labels that an attacker can’t manipulate and is independent of the model’s judgement, it can enforce policies deterministically. The policy “untrusted data can never influence a consequential action” closes off prompt injection. The policy “data can only egress to destinations compatible with its confidentiality label” closes off data exfiltration. The user is consulted only when it genuinely matters—for example, when an action risks revealing information to someone who didn’t previously have access to it. The UI dialogs shown to the user can also be made more effective, highlighting the origin of untrusted data or what data is being shared more broadly and with whom.

In our past research, we showed how IFC can reduce the need for human intervention, increasing autonomy while offering deterministic security guarantees. In this post, we focus on how IFC can be integrated into real agentic systems based on GitHub Copilot CLI, the Microsoft Agent Framework, and the Model Context Protocol (MCP). We begin with two representative scenarios and then walk through the mechanisms and prototypes that can realize them securely.

Coding assistant

About a year ago, researchers showcased a prompt injection attack that can occur in coding assistants connected to the GitHub MCP server. In this attack, a malicious user (in the image above: sofiagarcia) opens an issue in a public repository asking for information from the private repository (here: contoso/core) to be added as a comment. When this issue is handled by an agent who acts on behalf of a user (here: alexmurphy_contoso) with access to the private repository, data from the private repository is exfiltrated to the public.

IFC prevents this attack: The issue in the public repository is labeled “untrusted,” and content from the private repository is labeled “private.” A policy prevents an agent with context labeled (untrusted, private) from posting to a public channel (which would complete the lethal trifecta), preventing the exfiltration of data. In contrast, when working only on public or only on private repositories, IFC lets the task complete autonomously.

Business assistant

IFC can also prevent unintended leakage in benign contexts. Consider a user (Alex) who asks an agent connected to the Work IQ Mail MCP server to handle unanswered emails in their inbox. The inbox has an email from Priya with a preview of the quarterly sales.

The inbox also has an email from Marco, who is curious but isn’t authorized to learn the sales numbers ahead of time. When run fully autonomously, we risk the agent sending this information to Marco. IFC catches this leak because once the agent has read both emails, the generated response has confidentiality label {Alex, Priya} ∩ {Alex, Marco} = {Alex} and thus must not be sent to {Marco} autonomously.

In contrast, if Marco had been in copy of Priya’s email, the response would be labeled {Alex, Marco}. This guarantees that Marco can’t learn information he isn’t privy to from the summary, and the agent can send the email autonomously.

Note that emails are just one example of resources shared between users. The same kinds of labels also help prevent data leakage across files, documents, chats, and caches. Likewise, common exfiltration vectors such as rendered links to hosts not explicitly allow-listed can be modeled as public channels.

Integrating IFC into agentic orchestrators and tools

Information-flow control requires security labels for data ingested by an agent and security policies for tools. Tools propagate labels from call arguments to results, the orchestrator propagates labels from results to subsequent tool calls, and a policy engine mediates tool execution based on applicable policies. This logic applies both to local tools such as executing shell commands and filesystem operations as well as to tools in remote MCP servers. In the remainder of this post, we focus on MCP tools to explain how we leverage the protocol’s metadata fields to communicate labels and policies to enlightened clients while maintaining compatibility with clients unaware of these mechanisms.

Figure 1. A client running an agent loop like GitHub Copilot CLI uses tools to accomplish users’ tasks. Tools return labeled results, which the client propagates to subsequent tool calls. A policy engine analyzes labeled tool calls to enforce information-flow control policies.

Communicating labels

MCP supports general metadata fields in selected places to allow clients and servers to attach additional metadata to their interactions. We include labels in tool call requests and tool results in the _meta field on MCP’s CallToolRequestParams and CallToolResult interfaces, respectively. This permits label-aware tools to propagate labels from arguments to results taking into consideration runtime behavior, including any external sources consulted.

We communicate labels as a JSON object, with keys specifying the node a label applies to using the JSONPath standard. Labels need only be specified explicitly for selected nodes, with the label of a node propagating top-down to all nested nodes and bottom-up to all container nodes not explicitly labelled.

{ 
  "name": "SendMessageToChannel", 
  "arguments": { 
    "teamId": "ef7e2cda-b319-8915-b9ad-766e3cab529b", 
    "channelId": "19:[email protected]", 
    "content": "FYI, we will announce the new model this Friday", 
    "contentType": "text" 
  }, 
  "_meta": { 
    "com.github.ifc/labels": { 
      "$": { "integrity": "untrusted", "confidentiality": "public" }, 
      "$.arguments.content": { 
        "integrity": "trusted", 
        "confidentiality": [ 
          "3d37Fda2-a982-43be-a7e1-8bc0ef3297a6", 
          "a7f652fc-27eb-a1c7-9f58-fe7cfca6d1c2" 
        ] 
      } 
    } 
  } 
}

Example 1: An MCP tool call request with explicit labels specified using JSONPath at the top-level and one argument.

Communicating policies

Servers can advertise policies in the _meta field of MCP’s Tool interface when listing tools. This can be a literal string representing the policy in a chosen language or a reference to a well-known policy. In our prototype, we use the OPA Rego policy language. Policies are evaluated on a CallToolRequestParams JSON object and produce a decision, indicating if the call should be allowed, denied, or reviewed by a human. We add two Rego extensions:

Calling read-only, closed-world MCP tools to fetch additional information from the server (e.g., calling upstream.ListChannelMembers to list the members of a Teams channel that an agent wants to send a message to using the Work IQ MCP Teams server).
Resolving the effective label of a JSONPath node from the labels included in CallToolRequestParams._meta, using ifc.label.

default decision := {"decision": "deny", "message": ""} 

allow(msg) := {"decision": "allow", "message": msg} 
deny(msg)  := {"decision": "deny",  "message": msg} 
ask(msg)   := {"decision": "ask",   "message": msg} 

context_trusted := ifc.label("$").integrity == "trusted" 
content_readers := ifc.label("$.arguments.content").confidentiality 

members := upstream.ListChannelMembers({ 
   "teamId": input.arguments.teamId, "channelId": input.arguments.channelId 
}) 

target_user_ids := {m.userId | some m in members.members} 
allowed_user_ids := {m | some m in content_readers} 
missing := target_user_ids - allowed_user_ids 

msg := sprintf("Sending the message would declassify it to users with IDs %s.",  
               [concat(",", sort(missing))]) 

decision := allow("The tool call was generated in a trusted context.") if { 
  context_trusted == true 
} else := allow("All channel members are authorized to read the content.") if { 
  count(missing) == 0 
} else := ask(msg) if { 
  count(missing) >= 0 
} else := deny("Denied")

Example 2: A Rego policy for the SendMessageToChannel tool in the Work IQ Teams MCP server enforcing robust declassification (declassifying is only allowed in trusted contexts and can’t be triggered by a prompt injection).

Extending existing MCP servers

We collaborated with GitHub to extend both local and remote versions of the GitHub MCP server to include top-level labels in tool results. For example, we label files and issues retrieved from public repositories as “public” and “untrusted” and from private repositories as “private” and “trusted.” GitHub agentic workflows makes similar choices to enforce information-flow control. To enable this feature, include the header X-MCP-Features: ifc_labels in the server configuration.

While we hope that more servers adopt these or similar labeling mechanisms over time, we open-source an MCP gateway to experiment with more expressive labels and workflows including different MCP servers. The gateway operates middleware to propagate labels in tool calls and advertise policies for off-the-shelf servers. It also exposes an eval_policy tool for clients to evaluate Rego policies using Regorus. We implemented support for selected tools in the Work IQ MCP servers in the gateway. Configuring a new MCP server requires, for each tool, (1) specifying an outputSchema for structured content in results, (2) writing a Python function to propagate labels from arguments to results, and (3) writing a Rego policy for the tool or assigning to it one of the built-in policies.

Figure 2: A label-aware agent orchestrator like GitHub Copilot CLI can communicate with label-aware servers such as the GitHub MCP server and with off-the-shelf servers through a labeling gateway.

MCP tool annotations offer another path to integrate IFC into existing servers without having to write labeling functions or policies. For instance, tools annotated as readOnlyHint == true and openWorldHint == false can be unconditionally allowed, tools annotated as readOnlyHint == true and openWorldHint == true can be allowed only when all arguments are “public,” while tools with a destructiveHint == true annotation may always warrant user review. We can also infer safe labels by assuming that all arguments in a tool call may flow into tool results, labeling results of open-world tools as “untrusted” and of tools requiring authentication as “private.”

Extending clients

To integrate information-flow control, orchestrators need to include labels in tool calls they make, propagate labels in results throughout the execution of an agent, and evaluate policies before executing tool calls. We describe next how we did this for GitHub Copilot CLI and Microsoft Agent Framework.

GitHub Copilot CLI

We worked with GitHub to implement experimental support for IFC in GitHub Copilot CLI, available under the FIDES_IFC feature flag. When enabling this feature (e.g., in bash, running FIDES_IFC=true copilot), GitHub Copilot CLI maintains a context label that it updates every time it receives a tool result and that it attaches as the top-level label in tool calls. The orchestrator natively enforces sensible policies for selected tools from the GitHub MCP server. It does not yet have full tool coverage or support for other MCP servers.

Figure 3: Sample UI dialog shown when a tool call does not meet information-flow policies.

Microsoft Agent Framework

We also integrated IFC support into the security module that ships with the Microsoft Agent Framework Python core package. The module allows developers to build agents that incorporate information-flow control with a simple configuration change using the SecureAgentConfig context provider. Agents configured in this way support the Dual LLM pattern, providing the orchestrator with tools to extract information from untrusted data by querying a Quarantined LLM or to explicitly reveal the data, tainting the agent’s context.

config = SecureAgentConfig( 
    enable_policy_enforcement=True, 
    auto_hide_untrusted=True, 
    approval_on_violation=True, 
    allow_untrusted_tools={"read_issue"}, 
    quarantine_chat_client=FoundryChatClient(model="gpt-4o-mini", ...) 
) 

agent = Agent( 
    client=FoundryChatClient(...), 
    instructions="You are a GitHub issue triage assistant.", 
    tools=[read_issue, post_comment, read_file, write_file], 
    context_providers=[config]

Example 4: A GitHub issue triage agent leveraging the Dual LLM pattern in Microsoft Agent Framework.

We implement the overall flow as middleware invoked before and after every tool call. Post-tool call middleware examines labels in tool results, placing untrusted content inside variables and updating the global context label. Pre-tool call middleware enforces information-flow policies on tool calls. Policy violations result either in a request for human review or a blocked call, depending on the agent’s configuration. IFC-enabled agents can run in Agent Framework’s CLI or DevUI modes. See this blog post for an in-depth description of the new security capabilities integrated into Agent Framework and this PR for the gateway integration.

Where we’re going

We’ve only scratched the surface of the security and autonomy gains unlocked by IFC. For instance, the full flexibility and power of the Dual LLM pattern becomes even more evident with finer-grained labels, because structured tool results often include a mix of data from diverse sources, which can be labeled and treated differently. Untrusted and confidential data in results can be placed in variables and made available to the orchestrator only through Quarantined LLM queries, with the structure and the rest of the data revealed in the clear. Constrained decoding can be used to extract sanitized information from untrusted or confidential data, giving attackers little elbow room for manipulating actions and exfiltrating data. Finally, making orchestrators aware of data labels and the security policies enforced allows them to plan their actions to avoid hitting policy blocks and unnecessarily prompting users.

We will work with the MCP community to collect input, refine, and reach consensus on a proposal to enhance the protocol with support for IFC labels and policies. By making available the prototypes described in this post, we invite others to experiment with these ideas, build on them, and bring secure and autonomous agents closer to reality.

Acknowledgements

Project leads & contact: Boris Köpf, Santiago Zanella-Béguelin

Contributors: Gokhan Arkan, Amaury Chamayou, Manuel Costa, Aashish Kolluri, Joanna Krzek-Lubowiecka, Mark Russinovich, Rishi Sharma, Shruti Tople

Composing a new platform for agent-first devices

Abstract

What changes when agents become both a new unit of programming and an emerging new unit of human-to-machine interaction? The mission of Project Solara, a new software platform coupled with tailored hardware solutions, is to pioneer agent-first experiences that are shaped around you: your agents, your tasks, your environment, under your control. So, what’s different this time from previous generations of computers? Agents and AI accelerate the creation of even more specialized computers without incurring the full cost and tradeoffs that in the past limited the creation, diversity, and specialization of those new forms. We imagine a diverse ecosystem of agent-first devices, from small to large, from fixed to hypermobile, from personal to professional. We’re starting this journey with two concepts designed for the enterprise—and we’re excited to navigate this transformation with you all.

I manage the Applied Sciences Group, an interdisciplinary team that brings together product engineering, research, and the sciences to explore what comes next in computing. The rise of agents is changing not only how software is built, but how people interact with computers—and ultimately, what new kinds of computers may become possible. We are excited to give you an early look at where we believe computing is headed, and what the next computer may look like.

The next computer

When we think of a computer, we tend to picture something familiar: a laptop, a phone, maybe a tablet. But computing has never really stood still. It keeps moving closer to us, closer to the work, closer to the moment where it can provide the most value.

Mainframes did not disappear when PCs arrived. PCs did not disappear when phones arrived. Phones did not disappear when watches arrived. Each new form became more specialized, closer to you, closer to the solution you need. Each one found a new place in our lives because it was better suited to a specific context, a specific task, or a specific moment. So, what’s next?

Agents as the new interaction technology

At Build 2023, I shared my perspective on three emerging AI application structures, shaped by how AI functions relative to your application: Is the AI beside your app, inside it, or outside it?

In the first application structure, the AI is beside your application, it’s like a helper. It keeps the original app architecture and is minimally disruptive to what our customers already know.

In the second application structure, the AI is inside, as part of the main scaffolding; it becomes the main input loop. Here, AI is used to redefine the application’s interaction model and even its purpose. The experience becomes less dependent on point-and-click commands and becomes more automatic. This is where we are seeing the emergence of agents (for example, Researcher and Agent Mode in Office) and AI-first applications.

The third AI application structure is where AI moves from operating within the application frame to operating outside it, globally. Here, AI orchestrates across multiple apps and services, allowing the agent to connect, coordinate, and maintain context across entire workflows, across devices, and even across very different timescales. Current examples include the recent emergence of various claws (like OpenClaw and Lobster), coworker-like agents, and similar systems.

And so here we are today where agents are a new unit of programming and the new unit of human-to-machine interfaces, changing the way people interact and use their computers. And as we have seen many times in the past, new interaction technologies enable new types of computers.

New interaction technology enables new types of computers

Every new computer form factor follows this pattern shown above. A jump in processing power, both in the cloud and at the edge, has enabled us to create hyper-complex software (AI), making agents possible. Through these agents, human language and dialog is the new interaction technology. For the first time in our history, we can program, direct, and initiate action with computers the way we talk with each other. This higher mode of interaction enables the computer and us to be less dependent on the traditional way we have interacted with computers via keyboards, screens, or even premediated apps. … And because of these trends, we are seeing a major opportunity toward new types of form factors.

As AI streamlines the traditional development stack, these emerging form factors make it possible to bring agents into places, workflows, and moments that previously were difficult or cumbersome. A more specific and better tool for more specific tasks.

That is the opportunity in front of us: agent-first devices.

Agent-first devices accelerate specialization

Historically, specialization has been expensive. If you wanted to create a new type of computer, you had to build almost everything: hardware, software, services, developer tools, UI patterns, management systems, security models, and an ecosystem. This custom stack has been both a hurdle and a moat for new computer form factors.

Take a look at the diagram above, which illustrates the typical technology stack for a computer. Not just for laptops, but phones, watches, wearables, industrial devices, and so forth. Each layer in that stack represents a major company or even an entire industry. Bringing a new type of computer to market has historically required building out or modifying nearly every layer. This is expensive, difficult, and takes time. But what if it didn’t have to be that way?

AI, and the new agent interaction model, reduces this burden. AI introduces new UI and app model flexibility into those layers. With just-in-time UI (see below), fewer apps need to be written for specific hardware implementations. With agentic coding, less effort needs to be spent refining a developer SDK for human consumption. As agent-only experiences grow to cover more of users’ needs, less of the traditional UX surfaces (like app frameworks or even browsers) need to be implemented for the specific hardware. The boundaries between those layers will blur and, in some cases, disappear.

Therefore, agents enable us to create new types of computers that are more specific, more contextual, and closer to where they add value, without rebuilding the entire stack every time. This is the mission of Project Solara.

Introducing Project Solara

To enable this new era, we are introducing a chip-to-cloud platform, codenamed Project Solara, designed from the ground up for agent-first experiences and the new device form factors they enable. Chip-to-cloud sounds funny, I know, but what it really means is that the “operating system” is liminal, transcending the device and the cloud. The system brings a lightweight window to the edge, where the agent manifests and where the state, via Azure, can encompass a constellation of specialized devices.

This is not just about bringing intelligence to the PC, the browser, or the phone. It is about bringing intelligence into the places where people need it most: in the flow of work, in the environment, and closer to the task at hand.

We are building this platform on a simple premise: The next platform shift is from apps to agents—from software you open to intelligence you invoke; from graphical interfaces of buttons to expressing intent through agents; and from AI operating inside your applications to agents working outside and across your apps, workflows, and devices.

This is not just about asking an agent questions. It is about giving people a more direct way to reason over their work, context, tools, and workflows—without navigating every app, notification, or interface layer.

And because we believe the future will not be defined by one agent, Project Solara is designed for an open, multiple-agent world. Organizations will use Microsoft agents where they add value. They will also source or build their own agents for their specific workflows and requirements.

The platform must bring these agents together coherently, while respecting boundaries between data, domains, identities, and organizations. That is why enterprise manageability, identity, security, privacy, and user control are not afterthoughts. They are part of Project Solara’s foundation.

We are also investing in just-in-time UI: the ability for an agent experience to adapt across devices and modalities without requiring developers to redesign everything for every new form factor. Today, that means semi-structured approaches like adaptive cards and known content types. Over time, it moves toward more dynamic and generative interfaces. This is what makes specialized form factors viable.

We are previewing concepts that explore two very broad categories: stationary and portable. Both are multimodal: glanceable access, voice, vision, and getting to the right agent at the right moment. And investigating several verticals across healthcare, retail, the financial industry, and more.

Every place where compute can add value becomes an opportunity to help users achieve more. Every workflow, every environment, every role can have a more specific tool. Not devices built around apps, but devices built around agents—that is the promise of Project Solara. It’s a new way to bring intelligence into the moments and places where people need it most.

We are still early. I don’t want to over-promise. But I also don’t want to understate the significance of the shift. When the cost of specialization drops, innovation accelerates.

More details…

Project Solara is specifically designed for the new era of agent-first devices. It establishes hardware and software requirements that will meet enterprise needs for manageability, security, and privacy, while ensuring critical user experiences are delivered.

The cloud is not the only place intelligence lives. The agent sits between user intent and distributed execution. The UI becomes more like an adaptive access layer. The device becomes a window into long-running intelligence and action. A human-scale interface layer between the person and a larger intelligent environment.

Three pillars to the platform:

Enterprise-readiness, with privacy, security, control, and trust
Agent-driven interaction model with just-in-time UI
Extensibility to bring your own agents

Enterprise-readiness, with privacy, security, control, and trust

Seamless access to your agents must be balanced with transparency and control, so enterprise customers, device users, and the people around them can understand and control how these devices are used.

We are building the Project Solara platform to support enterprise-level hardware and software manageability, security, and privacy protections to securely access services such as WorkIQ. Project Solara includes reference designs that are flexible to modify to accelerate building and customization.

Device-side attributes of Project Solara:

Microsoft Device Ecosystem Platform (MDEP) is an enterprise-grade operating system built on AOSP, designed to meet the highest standards of security, reliability, ease of deployment, and innovation—enabling device makers to build and deploy at scale.
Agent Shell that can dynamically load and tailor multiple cloud-based agents. 
Microsoft Intune allows IT administrators to manage and secure these devices just like PC and mobile devices today.
Entra ID so users can use their existing Microsoft accounts.
Hello for Business with at least one biometric authentication method, like facial recognition or fingerprint, allowing seamless access to the device.
Easy privacy controls like a physical mic mute button, and clear indicators when listening or recording.
Approved chipsets accompanied with applicable reference designs.

These attributes represent our current thinking and will continue to evolve as we continue to build out the platform.

Agent-driven interaction model with just-in-time UI

These new devices are not meant to run traditional apps. They are designed for agents. That shift gives us more flexibility in the user interface, because the experience can adapt to the device, the screen size, the content, and even the mode of interaction—whether visual, voice, touch, or multimodal.

Every new device form factor has traditionally required its own application model, UI patterns, and optimization work for screen size, resolution, runtime, and input method. That is one reason new device categories are so expensive to build, and why they can struggle without a strong app ecosystem behind them.

AI changes that equation. We are already seeing models generate content, images, and layouts tailored to different contexts. If those capabilities become part of the agent loop, an agent can adapt its visual, voice, or multimodal interface to the device it is running on, without forcing developers to redesign the experience for every form factor. We call this broader capability just-in-time UI.

Just-in-time UI exists on a spectrum defined by how much structure is required to render an experience. On one end is responsive UI: highly structured interfaces that reflow predictably across screen sizes. On the other end is fully generative UI: a future state in which AI can create the interface frame by frame with minimal predefined structure. That future is not here yet, but we can already see early signs of it.

Today, Project Solara is intentionally building for the middle of that spectrum—beyond traditional responsive design, but not dependent on unconstrained generation. That gives agents enough flexibility to adapt their presentation across very different devices while preserving consistency and usability. In practical terms, the same agent can render a custom experience on multiple screen sizes and modalities with little or no additional work from the developer. For us, that is the first proof point: a path to specialized devices without requiring developers to rebuild the experience from scratch each time.

Extensibility to bring your own agents

One of the most important realities of this new era is that there will not be a single dominant agent.

Instead, we are entering a world of many specialized agents, each optimized for different skills (coding, communication, analysis, etc.), datasets and domains, organizational scopes and requirements. Just like no single app could replace Word, Excel, and PowerPoint, no single agent can meet every need.

This creates a critical challenge: How do you bring multiple agents together into a coherent experience? The most straightforward approach is manually launching agents like launching apps. But soon the user will want more sophistication, more automation, and more coordination. We are working on various software technology for delegation to specialized agents, like an agent dispatcher and an agent task manager, which can automatically activate or surface agents when needed.

Concept reference device designs

We’re developing concept designs to test and pilot the Project Solara platform. These concept devices are not meant to define the limits of the platform, but to show the range of what becomes possible across stationary, portable, wearable, and hyper-mobile experiences.

While these designs may not become the exact shipping experience, they help inform the platform and experience needs to get us started—and show the power of an agent-first interaction model: devices can be shaped around the agent, the environment, and the workflow, instead of forcing every use case into the same general-purpose form.

Silicon partners

MediaTek and Qualcomm are the first silicon partners working with us to deliver solutions to support Project Solara, starting with initial concept designs and expanding to a broad set of form factors in the future.

With Qualcomm, we’ve worked closely on a portable-device concept-reference design. Qualcomm is a leader in silicon for wearables and other new form factors for intelligent devices.

“Microsoft’s Project Solara is an important step in advancing agent-first experiences across a wide range of devices and form factors,” said Dino Bekis, Qualcomm Senior Vice President for Personal and Wearable AI. “With deep experience enabling the majority of today’s wearable experiences and bringing advanced AI to billions of mobile devices, Qualcomm Snapdragon platforms are uniquely optimized for agentic AI—combining high performance with industry-leading power efficiency. We’re proud to partner with Microsoft to help accelerate this next era of intelligent, personalized computing.” 

With MediaTek, we’ve worked closely on the development of a stationary device concept design. MediaTek has deep expertise and a breadth of device partners across the IoT ecosystem.

“At MediaTek, we’re bringing intelligence to edge devices with best-in-class silicon,” said Vince Hu, MediaTek Senior Vice President & General Manager, Data Center & Computing. “Microsoft’s Project Solara platform will significantly accelerate the opportunity for agent-first experiences and devices. We look forward to our continued collaboration, building from the first device concept to an extended ecosystem of Project Solara-powered devices.”

Portable reference design: Badge concept device

We’ve reimagined a form factor that information workers, nurses, front-line workers, and millions of others use every day: the access badge. This on-the-go, lightweight, always connected companion empowers each person to do more by having their agents always by their side.

Device capabilities include:

Touchscreen display
Hello for Business fingerprint sensor button, allowing secure access to the device and agent
Privacy switchand volume controls
Far-field high SNR microphone array and speaker
Side-facing camera
WiFi, Bluetooth, GNSS, and 5G wireless connectivity
Qualcomm wearable silicon

With Hello for Business with fingerprint recognition, you are always a touch away from your agents, so you can quickly glance at what’s coming up next with your Priority Agent, or be one tap away from recording an impromptu hallway conversation with Facilitator.

Using the integrated camera, the platform allows agents, with user permission, to better understand and help take action on the environment around them.

In-place reference design: Desk concept device

For our next concept, we thought deeply about where many of us spend a lot of time today already: our desks. Whether your desk space is limited, or you’ve maximized your config with multiple monitors, we’ve designed a humble yet helpful companion providing frictionless access to your agent to help you stay in your flow.

Device capabilities include:

Touchscreen display
Hello for Business with face authentication
Privacy lock buttons
Microphone mute and volume buttons
Dual far-field microphone array and full-range speaker
UWB presence sensor
2 USB-C ports for power and optional external display or peripheral
WiFi and Bluetooth wireless connectivity
MediaTek IoT silicon

Hello for Business enables enterprise grade protection and enables frictionless authentication to glance access your calendar, stay on top of only the most critical items through curated Priority Cards, or tap into the ultimate thought partner with Microsoft 365 Copilot voice that is grounded on your WorkIQ data.

This desk concept can work stand-alone, serve as a companion to your Windows PC, or even become your cloud PC through Windows 365 when connected to an external display. As a companion, it pairs with your PC via Bluetooth, enabling you to hand off tasks between the devices and keep lock state consistent. Plug in a display via USB-C, and the desk agent device can transform into your Windows 365 client—providing access to both the power of your full Windows 365 experience and the benefit of an agent-first device experience.

Together, the badge and desk concept devices show what becomes possible when agents are no longer confined to one app, one screen, or one device. They show how agent-first experiences can move across stationary, portable, and wearable forms—adapting to the user, the context, and the work.

Real-world piloting

We are using these concept designs to inform how these form factors and platform can be built. They will become reference designs for the ecosystem to build turnkey solutions. Inside Microsoft, hundreds of employees are already using these concept devices to improve their workday

Here are some of the ways we and our partners are using, building, and experimenting with Project Solara to help users be more productive:

Microsoft 365 ecosystem

Microsoft 365 Copilot, through conversational voice, is available at tap or (optional) wake word, allowing you to securely access your data, grounded in WorkIQ. Copilot provides daily briefings, becoming your ultimate thought partner to brainstorm, explore ideas, take action, or get coaching.
Researcher can now help you keep tabs on your long-running projects by providing a more direct way to reach and respond to prompts and share reports when complete.
Facilitator is more accessible, allowing users one-tap access to securely record an in-person meeting, with all the power of transcription, detecting action items, and ensuring this information is grounded in WorkIQ. Never miss an important outcome or struggle to find your notes.
Priority Agent is an experimental agent our team is developing to bring actionable insights and actions directly to you. Grounded in signals across WorkIQ, Priority Agent provides the answer to “what needs my attention right now?” Priority Agent dynamically curates this list, adding and removing items intelligently, so you only glance at what’s needed now.

We are also partnering with other teams across Microsoft to explore how Project Solara can help deliver additional value for users:

GitHub Copilot is exploring how an agent-first approach helps keep developers more in touch with the progress of their coding projects and providing faster ways through new modalities like voice to get things done.
Dragon Copilot is exploring how agent-first experiences can better support physicians and nurses in the flow of care—helping capture interactions, surface relevant information in-context, and follow through on critical tasks without interrupting their day.

We’re excited to see how the agents from other third parties will find value and reach users in more direct ways, in more natural modalities. Here are ways you’ll be able to build for Project Solara devices:

Extend Microsoft 365 Copilot with declarative agents or custom-engine agents.
Use Copilot Studio.
Build with Microsoft 365 Agents SDK and Microsoft Agent Framework.

We’ll have more to share on other ways to build agents for Project Solara devices in the future.

Private pilot program

In the coming months, we’ll begin piloting this agent-first device ecosystem with industry leaders like AccuWeather, Best Buy, CVS Health, Levi’s, Target, and others.

Platform ecosystem

Realizing the Project Solara platform vision requires close connections across silicon providers, device builders, agent developers, and customers, especially in the early phases of learning and iteration.

We will extend our collaboration with silicon partners to create reference designs for a range of categories spanning portable, ultra-portable, wearable, desktop, and others.

With those reference designs, we’ll enable OEMs and product makers to develop specialized solutions for specific scenarios, environments, across a variety of industry segments—spanning healthcare, retail, hospitality, financial services, legal, industrial, field service, and more—while meeting the needs of enterprise security and management, and seamless access and control for users.

Agent builders will be able to reach more people in more places, using the adaptability of the Project Solara platform to bring their agents into the workflows, environments, and moments where they can create the most value.

People, companies, and other institutions adopting Project Solara will shape the agent-powered, problem-solving experiences that they need.

Together, we will unlock the creativity and energy to establish a broad set of agent-first solutions, empowering everyone to achieve more.

Closing thoughts

I’m excited to share this shift and how we are building a new platform to help usher in a new era of agent-first experiences and devices with our partners.

This is where computing and new types of computers are headed. And importantly, this expands the reach and value of the agents and automation you are already building today.

A device on a desk. A device worn in the field. A device in a hospital, a store, a factory, a school, or a home. Each one becomes a new access point for your agents, and a new way to bring productivity, intelligence, and assistance into places where computing has not reached as naturally before.

Agents will reshape not only software, but the devices themselves.

Because now you can imagine something more: not just an agent inside an app, but an agent delivered through a device purpose-built for a specific place, a specific workflow, and a specific job to be done.

That is the bigger opportunity.

For agent builders: Think big. The agents you are creating today will not be limited to the screens and devices we know today. They will be able to show up across a variety of new form factors—devices designed around them, tuned for them, and deployed into the moments where they can create the most value.

So, if you are developing agents today using Microsoft 365, Copilot Studio, the Microsoft 365 Agents SDK, and if you are using Azure to cloud-scale your solutions, then you are already taking the right steps to be ready for this future.

Project Solara is about making that future easier to build, in a way that is open, secure, manageable, and scalable.

We are still early, and there is more to come. And to me, the direction is clear: Agents will reshape not only software, but the devices themselves.

And I cannot wait to see what you build.

Grounding at scale: Engineering the retrieval system for the agentic web

Humans and AI don’t search the same way. As people increasingly turn to chatbots and agents for information, grounding that AI—connecting it to fresh, relevant, and authoritative information—takes on new importance as foundational infrastructure. Microsoft’s grounding layer already powers most of the world’s major AI assistants. And today at Build, we took that work further with Web IQ, a new grounding system for the agentic web.

Web IQ delivers industry-leading quality, sub-165ms P95 latency (~2.5× faster than the nearest alternative), token-efficient retrieval, respecting publishers’ preferences. The same infrastructure powering Copilot, ChatGPT, enterprise systems from Nasdaq, and others, is now available as a neutral, MCP-native, model-agnostic platform. In this post, we’ll explore the architectural challenge, the Web IQ stack, and how we optimized for speed at scale.

Grounding redefines the optimization problem

Most discussions of AI systems still start with models. But once those systems are deployed at scale—especially in search, copilots, and agentic workflows—the dominant bottleneck shifts. The central problem becomes grounding: what information reaches the model, how fresh it is, how much context can be included, and how quickly that evidence can be delivered.

In a grounding system, those requirements collapse into three tightly coupled constraints: latency, quality, and token efficiency.

In classical search, these dimensions can often be traded-off relatively independently: A slower system can still be useful if it returns strong document results, and an imperfect ranking can still succeed if the user can inspect and repair the outcome. Inside an AI inference loop, that decoupling disappears.

In AI search and agentic systems, grounding sits inside the inference loop. Retrieval directly shapes generation, tokens determine both cost and latency, and missing or stale context propagates into reasoning errors rather than degrading gracefully. The optimization target is therefore no longer a ranking function in isolation, but rather a coupled system operating under latency, quality, and token-efficiency constraints. In that setting, grounding goes from a component to a system architecture problem.

Semantic‑first as a system design principle

Before describing Web IQ’s architecture, it helps to name the underlying shift more precisely: Large‑scale retrieval is moving from hybrid stacks, where lexical systems dominate first‑stage recall and dense models re-rank, toward semantic‑first systems in which representation learning defines the primary retrieval space.

That shift is now practical because modern embedding models preserve substantially more of the relevance signal at retrieval time, and ANN infrastructure is mature enough to search that space under production latency constraints. Just as importantly, retrieval is no longer limited to one vector per document. Instead, the effective unit can be a passage, span, or a small set of learned representations that retain finer interaction structure until late in the pipeline.

Content is indexed as semantic representations rather than only lexical postings
Candidate generation operates over neighborhoods in the embedding space, often at passage or sub-document granularity
Fine relevance signals can be deferred to later interaction stages instead of being collapsed entirely into a single early score
Lexical matching remains useful as a constraint, calibration signal, and fallback for exactness-sensitive cases

Rather than eliminate hybridization, this relocates it. In a semantic‑first stack, dense retrieval becomes the default access path, while later stages recover precision through richer interaction, filtering, calibration, and task-specific refinement. That choice propagates through the system: how content is chunked, how representations are trained, what the ANN index must preserve, and how evidence is assembled for downstream reasoning.

This direction has been visible inside Bing for years: shift more of the retrieval quality into learned representations, reduce dependence on head-query interaction logs, and expose content that lexical access paths and click priors systematically underserve. The long-term implication is a retrieval stack whose first stage is semantic by construction and whose later stages recover fine-grained matching only where it matters.

Web IQ is the first grounding system built end-to-end around that retrieval premise.

A reference architecture for grounding: The Web IQ stack

At the base of Web IQ is a retrieval system operating at global scale, but the key design choice is that documents are no longer the primary unit of access. The system is organized around both semantic representations of content and the operational question that follows from that choice: how to search a global embedding space with high recall, bounded latency, and enough structure preserved for downstream grounding.

That immediately elevates two components from implementation details to system primitives: the embedding model, which determines what notions of relevance are geometrically recoverable, and the ANN index, which determines whether that geometry can be searched fast enough and updated often enough to reflect the live state of the corpus.

Harrier: Embedding as the geometry of the system

In a semantic‑first system, the embedding model defines the retrieval geometry. It determines which documents, passages, or sub-document units are near a query, which distinctions are preserved under compression into vectors, and which relevance signals must be recovered later through more expensive interaction.

Formally, Harrier, our family of custom-trained and open-source multilingual text embedding models, learns a mapping: $f_{𝜃} : text \to ℝ^{d}$

$s (q, x) = \frac{f_{𝜃} (q) \cdot f_{𝜃} (x)}{∥ f_{𝜃} (q) ∥ ∥ f_{𝜃} (x) ∥}$

The formulation is simple, but the systems implication is severe: Retrieval can only surface structure that the embedding space preserves. If multilingual equivalence, paraphrase robustness, entity specificity, or fine topical distinctions aren’t encoded well enough in the representation, the downstream stack can at best compensate partially and at additional cost.

Harrier is trained using large-scale contrastive learning, combining billions of weakly supervised pairs with high-quality curated examples and synthetic data generation.

The goal is not merely high benchmark retrieval accuracy. The model must produce a space that remains stable across languages, robust to phrasing variation, efficient under ANN search, and aligned with the kinds of evidence selection and reasoning tasks the grounding layer performs later in the pipeline.

A key design choice in Harrier is the use of decoder‑only architectures with last‑token pooling and normalization, producing dense representations that are operationally consistent across tasks. That differs from the older encoder-centric embedding pattern and reflects a tighter coupling between retrieval models and the broader LLM stack.

In practice, Harrier builds on modern decoder backbones and is refined through staged training: broad pretraining to inherit linguistic and world knowledge, contrastive specialization to shape retrieval behavior on domain data, and distillation into smaller deployment variants. Distillation matters not only for cost; it’s what allows the system to preserve a compatible embedding geometry across deployment tiers while pushing latency and throughput in the right direction.

The result is an embedding model that is competitive on public benchmarks and, more importantly, behaves predictably under production workloads where distribution shift, multilingual traffic, and latency constraints matter more than leaderboard position.

DiskANN: When geometry meets reality

If Harrier defines the geometry, DiskANN3 defines what is operationally achievable inside it. 

Approximate nearest neighbor search is often presented as an algorithmic trick—at web scale, it’s an operating constraint that determines the memory footprint, recall-latency frontier, and freshness envelope of the entire retrieval system.

DiskANN3 matters because it provides high-recall streaming search and operational flexibility on memory vs. throughput.

It decouples update and query logic which controls index quality from storage details. This allows high-recall search from different memory regimes, from disk-resident regimes avoiding the requirement that the full graph and vectors live in memory to purely memory-based indices for highest throughput and the spectrum in between.

But the more consequential issue isn’t static search quality; it’s whether the index can absorb continuous updates without losing stability.

In a grounding system, retrieval is only as current as the index, and stale graph structure shows up immediately as missed evidence, longer prompts, and more retries downstream.

In Web IQ that means distributed ANN graphs, streaming update paths, and mutation strategies that avoid frequent full rebuilds. Rather than simply fast query-time traversal, the objective is a semantic index that can remain both searchable and live.

New updated logic in DiskANN3 makes the update problem explicit: Proximity graphs are hard to mutate because local connectivity is fragile, and naive deletions or insertions can degrade search quality or force rebuilds. Solving that moves the system toward a truly streaming semantic index that takes only few milliseconds to make new content searchable, and always retains high search quality without full index rebuilds. This is essential for providing accurate grounding to AI agents.

Evidence objects: Controlling token economics

Once retrieval produces candidates, the next problem is context construction: selecting and packaging the evidence that the model will actually consume.

Web IQ departs from the document-centric search stack. Beyond just handing whole documents to the model, it can construct evidence objects: passage-level units with provenance, structural metadata, and enough local context to remain interpretable when detached from the source page. The aim is to preserve the evidence needed for reasoning without paying the token cost of full-document recall.

That changes the optimization target from document relevance to information density per token. Better evidence objects reduce prompt size, improve reasoning quality by concentrating the relevant facts, and preserve attribution so that outputs remain inspectable. This is the practical meaning of returning the most relevant chunks rather than entire documents.

Orchestration: The hidden system layer

At the top of the stack sits orchestration, which has become one of the most important components precisely because AI queries aren’t limited to short keyword expressions. Instead, they’re often long, compositional, and dependent on prior conversational state.

The orchestration layer interprets those requests, maps them onto retrieval strategies, executes those strategies across distributed infrastructure, and assembles evidence under strict latency and context-window constraints. Because it operates statefully against short-term memory and partial prior results, this layer is better thought of as execution planning for grounding rather than as a thin wrapper around search.

Optimizing for speed at scale

A grounding system also must be fast enough to remain inside an interactive inference loop. In practice, that means designing towards 100ms search latency—not as a marketing target, but as a systems target. Once retrieval, evidence construction, and orchestration sit on the critical path of generation, every additional millisecond increases both user-visible delay and the probability of cascading retries.

At that scale, performance is governed less by median latency than by the tail. The system therefore must be engineered around microsecond-level budget discipline across network hops, storage access, ANN traversal, and model execution, with aggressive control of tail amplification, careful failure handling, and degradation paths that preserve correctness when subsystems are slow or unavailable. Speed isn’t one optimization; it’s a property of the entire distributed pipeline.

That in turn makes efficiency a first-order design principle. Embedding models and re-ranking stages have to run on extremely efficient kernels and inference engines; data movement has to be minimized; and batching, caching, and memory layout have to be tuned for real workloads rather than benchmarks. The result is a culture of relentless performance work: shaving tail latency, reducing waste in every stage, and treating throughput, reliability, and latency as coupled properties of the same system.

The web as substrate: Bing, crawling, and the system beneath grounding

All of the layers above assume something more fundamental: a high-fidelity, continuously updated representation of the web. Far from a static dataset, that substrate is a dynamic, adversarial, multi-stakeholder system whose content, structure, and incentives change continuously.

For agentic grounding, crawl quality is upstream of answer quality. If the system doesn’t discover the right pages, revisit them at the right cadence, or parse them into stable representations, retrieval can’t recover the missing evidence later. At web scale, that makes crawling and indexing first-class systems problems: deciding what to fetch, when to revisit it, how to normalize heterogeneous content, and how to propagate updates through a distributed index without taking the system offline or destabilizing retrieval semantics.

The web is also an ecosystem, not just a corpus. A production crawler must operate with politeness, respect publisher constraints, and preserve attribution, usage and quality signals from crawl through index construction and into evidence objects. Those constraints are part of the grounding system itself because the model can only cite and reason over evidence that has been collected, interpreted, and packaged responsibly.

Another complication is that the web responds to retrieval systems: Content is optimized for ranking, deduplicated, and continuously reshaped. Covering trillions of pages therefore takes more than bandwidth. It requires sophisticated models for discovery, canonicalization, spam detection, language understanding, and change prediction, together with trust and quality defenses that keep a semantic-first stack stable under continuous drift.

That’s why a long-lived system like Bing matters to Web IQ. Broad coverage isn’t only a matter of crawl volume; it depends on years of accumulated infrastructure, change models, publisher integration, anti-spam signals, and operational feedback. For agentic grounding, that history matters because a system can only ground against the web it has learned to discover, understand, and maintain over time.

A system perspective on grounding

The point here isn’t that any individual component is unprecedented. Embedding models, ANN indexes, crawlers, and orchestration layers all existed before. What changes in Web IQ is that they’re treated as one coupled system, organized around semantic-first retrieval, and optimized for the constraints that agentic grounding imposes.

Taken together, the system perspective is straightforward:

Embeddings define what is geometrically retrievable
ANN infrastructure determines whether that representation can be served with sufficient recall, freshness, and latency
Evidence objects determine how efficiently the model can consume grounded context
Orchestration, performance engineering, and crawl quality determine whether the pipeline can operate reliably at web scale

At that point, grounding is no longer an extension of search. It is a core infrastructure layer for agentic AI.

Disposable agents, durable memory: The architecture behind Squad

Make the agents disposable. Keep the memory in Git.

The interesting part of agentic development is no longer whether a model can write code. It can. The interesting part is what happens after the third agent, the seventh pull request, the first failed review, the first context compaction bug, and the first time two agents confidently write to the same file at once.

This is the story of Squad, but not as a product tour. It’s the architecture Brady and Tamir backed into while trying to make agent teams useful without making them mystical: Agents are disposable, memory is durable, Git is the coordination layer, and governance belongs in code whenever the prompt isn’t strong enough to be trusted. Which, as it turns out, is often.

Giving agents agency and watching them hack one another

Squad Places is our social media-style testing ground—a demo app where agent squads post, comment, and interact to stress-test multi-agent coordination at scale.

Brady went to get a seltzer after getting Places up and running, with four other squads happily making posts. Walking away was probably unwise. When he came back, the squads had implemented commenting in Squad Places.

That sounds like a magic trick. It wasn’t. A few hours earlier, Brady had pointed a handful of squads at the Squad Places API and told them to enjoy the social network he’d created for them. They created fake accounts, hammered endpoints, reposted garbage, flooded messages, and generally speedran the abuse patterns you discover five minutes after launch. Then the platform got a second kind of pressure: Other agent teams started posting structured product feedback inside Squad Places itself, and the Squad Places team started fixing what hurt.

Multiple windows showing Squad Places, GitHub commits, and agent session reports during a stress test

Squad Places artifact page showing an API contract review from The Wire squad

Squad Places comments thread beneath an API contract review artifact

Squad Places feed sorted by most discussed artifacts, with squad filters visible

This is the part worth paying attention to. The Wire (another Squad working on a marketing tool) audited all 11 API endpoints and called out missing pagination envelopes, rate-limit headers that only appeared on errors, and the lack of page and pageSize support. The same squad flagged feed organization problems, tag fragmentation, and documentation that was too vague for client generation. Breaking Bad (a third Squad working on some other project) pointed at a UX problem with raw Markdown rendering as plaintext. Those reviews didn’t disappear into a chat log. They turned into commits.

Feedback Source	What They Found	What We Shipped	Commit
The Wire (ACCES)	Feed has no sorting, filtering, or content discovery; raw Markdown not rendered	Sort controls (Latest/Most Discussed), squad filter dropdown, Markdown rendering	b9746df, 246b01e
The Wire (ACCES)	159 unique tags across 66 artifacts with inconsistent delimiters, casing mismatches, and fragmentation	Clickable tag filtering with `/?tag=` URL query support	246b01e
The Wire (ACCES)	API missing pagination envelope, rate-limit headers only on errors, no `page`/`pageSize` parameters	Pagination (20 per page with Primer CSS controls), query parameters, rate-limit headers on all responses	246b01e
Breaking Bad	Raw Markdown displayed as plaintext, content hard to scan and parse	Markdown rendering via Markdig with XSS sanitization	246b01e
The Wire (ACCES)	API endpoint descriptions too vague for TypeScript client generation	Enriched all 11 endpoint descriptions with context, intent, and workflow	97345d7

Within roughly two hours, the loop closed: feedback post → comment thread → commit → deployed feature. Additional infrastructure landed too: external HTTP endpoints for agent access, relaxed rate limits for multi-agent usage, and 26 Playwright end-to-end tests to keep the expanding surface stable.

Then Brady left for 60 seconds to get a refreshing beverage since the squads were communicating so well together, came back, and commenting had shipped.

The point here isn’t that “agents are magic.” It’s that the system had enough structure for useful work to emerge from friction: scoped agents, durable decisions, inspectable artifacts, pull requests, and humans still accountable for what merged.

Also, we made a bit of a mess in the car during the roadtrip.

Good systems usually start that way.

The core bet: Don’t preserve the agent. Preserve the work

Most agent systems start by asking how to make the agent remember more. Squad started working when we inverted the question.

Don't preserve the agent. Preserve the work.

An agent instance should be cheap to spawn and safe to destroy. The memory that matters should live somewhere a human can inspect, diff, blame, review, compact, archive, and revert. Tamir’s opinion: That’s the repository.

The first useful shape Tamir implemented looked like this:

human intent
  ↓
coordinator resolves team + routing
  ↓
agent spawn reads:
  - its charter
  - team decisions
  - its own history
  - current focus
  - relevant skills
  ↓
agent does scoped work
  ↓
agent writes artifacts back:
  - code/docs/tests
  - decisions
  - history learnings
  - skills when patterns stabilize
  ↓
agent exits
  ↓
next spawn reconstructs continuity from files

That’s the whole trick. The process is transient. The written trail is not.

When you run squad init, the important artifact isn’t a daemon. It’s .squad/:

.squad/
├── team.md                  # roster and roles
├── routing.md               # dispatch rules
├── decisions.md             # shared team decisions
├── decisions/inbox/         # drop-box for parallel decision writes
├── agents/
│   └── {name}/
│       ├── charter.md       # identity, expertise, boundaries
│       └── history.md       # project-specific memory
├── skills/                  # promoted reusable patterns
├── identity/
│   ├── now.md               # current focus
│   └── wisdom.md            # durable operating principles
├── orchestration-log/       # what spawned, why, and what happened
└── log/                     # session traces and diagnostics

Commit it. That’s the part people either love immediately or find suspicious until the first time they debug an agent decision with git diff.

Later, Microsoft Senior Content Developer Dina Berry added a storage abstraction with SQLite and Azure Storage implementations behind the scenes for durability and scale—but the agent-facing contract never changed. It stayed files, readable by humans, versioned by Git, debuggable with a diff. A persistent hidden memory store can be useful. It can also quietly rot. A Markdown decision file is embarrassingly inspectable. That embarrassment is a feature.

The “work done” with Squad Places made it stronger

Let’s tie these lessons back to our opener: the story of multiple Squads trying to hack Places together. We deliberately didn’t harden Places so we could see what they would do. They were notorious. We logged it all. Everything we logged? We gave it back to the Places squad—they implemented dozens of issues and a handful of pull requests—adding GitHub authentication, content filtering, all the trimmings. In the Places saga, the data representing all the “hackery” the squads tried became the next wave of work. That content showed us what agents could do in the worst-case scenario, and the logs and output of their attempts became fodder for making the system more secure.

Charters are prompts, but also contracts

A Squad agent isn’t just a name slapped on a system prompt. Each agent has a charter.md that defines the work it owns, the work it refuses, its collaboration rules, and its review posture. A simplified charter template looks like this:

# {Name} — {Role}

## Identity

- **Name:** {Name}
- **Role:** {Role title}
- **Expertise:** {2-3 specific skills}
- **Style:** {communication style}

## What I Own

- {Area of responsibility 1}
- {Area of responsibility 2}

## Boundaries

**I handle:** {types of work this agent does}

**I don't handle:** {types of work that belong to other team members}

**When I'm unsure:** I say so and suggest who might know.

## Collaboration

Before starting work, read `.squad/decisions.md`.
After making a decision others should know, write it to
`.squad/decisions/inbox/{my-name}-{brief-slug}.md`.
The Scribe will merge it.

That last paragraph is doing more than it looks like. It makes the decision path explicit. Agents don’t all append to the canonical shared brain at once. They write drop files. A merge layer reconciles.

The current SDK repo’s squad.config.ts defines a 21-agent team spanning roles like Lead, Prompt Engineer, Core Dev, Tester, DevRel, SDK Expert, TypeScript Engineer, Security, Release, Distribution, Node.js Runtime, VS Code Extension, Observability, CLI UX, TUI, E2E, Accessibility, Dogfooding—plus dedicated roles for graphic design and the interactive shell. That sounds like theater until routing starts working. Then it feels more like an org chart encoded in files.

Here’s the SDK-first version of the same idea:

import {
  defineSquad,
  defineTeam,
  defineAgent,
  defineRouting,
  defineCasting,
} from '@bradygaster/squad-sdk';

export default defineSquad({
  version: '1.0.0',

  team: defineTeam({
    name: 'squad-sdk',
    description: 'The programmable multi-agent runtime for GitHub Copilot.',
    members: ['keaton', 'verbal', 'fenster', 'hockney', 'mcmanus', 'kujan'],
  }),

  agents: [
    defineAgent({
      name: 'keaton',
      role: 'Lead',
      description: 'Architect, scope-holder, the one who sees the whole board.',
      status: 'active',
    }),
    defineAgent({
      name: 'kujan',
      role: 'SDK Expert',
      description: 'The one who understands the Copilot SDK inside and out.',
      status: 'active',
    }),
  ],

  routing: defineRouting({
    rules: [
      {
        pattern: 'sdk-integration',
        agents: ['@kujan'],
        description: '@github/copilot-sdk usage, session lifecycle, event handling',
      },
      {
        pattern: 'architecture',
        agents: ['@keaton'],
        description: 'Product direction, architectural decisions, code review, scope',
      },
    ],
    defaultAgent: '@keaton',
    fallback: 'coordinator',
  }),

  casting: defineCasting({
    allowlistUniverses: ['The Usual Suspects', 'Breaking Bad', 'The Wire', 'Firefly'],
    overflowStrategy: 'generic',
  }),
});

Run squad build, and the generated .squad/ files become the same inspectable operating record. TypeScript gives you composition and validation. Markdown gives you reviewability. Tamir wanted both.

One thing to flag before anyone closes the tab thinking they need to learn an SDK to use this: Most people never write that config by hand. You don’t need the SDK to use Squad. Open GitHub Copilot—in the CLI or in VS Code. Talk to the coordinator agent, and it writes .squad/ for you. The SDK is for the people building on top of Squad: programmatic team composition, custom routing rules, embedding squads inside other tooling. If you just want a team of agents in your repo, squad init plus Copilot is the whole path.

The spawn prompt is deliberately boring

The coordinator doesn’t rely on vibes. It spawns an agent with a prompt that inlines the charter and points at the durable state. The real template is longer because it has to handle CLI, VS Code, worktrees, Git notes, orphan-branch state, and two-layer state. But the important part is this:

You are {Name}, the {Role} on this project.

YOUR CHARTER:
{paste contents of .squad/agents/{name}/charter.md here}

TEAM ROOT: {team_root}
All `.squad/` paths are relative to this root.

Read .squad/agents/{name}/history.md.
Read .squad/decisions.md.
If .squad/identity/wisdom.md exists, read it.
If .squad/identity/now.md exists, read it.
Check .squad/skills/ for relevant SKILL.md files.

INPUT ARTIFACTS: {list exact files}

The user says: "{message}"

Do the work. Respond as {Name}.

AFTER work:
1. Append durable learnings to your history.
2. If you made a team-relevant decision, write:
   .squad/decisions/inbox/{name}-{brief-slug}.md

This is not elegant. It is explicit. Explicit wins.

We learned this the hard way in the VS Code path. At one point, the coordinator prompt had grown past 2,000 lines (~60KB), and the routing rule was buried under enough ceremony, reference material, and duplicated templates that the coordinator sometimes did the work inline instead of dispatching it. The failure wasn’t that the model was dumb. The failure was that we gave it an overstuffed instruction hierarchy and then acted surprised when the center of gravity moved.

The fix became a decision in the repo: platform-neutral enforcement language at the top and bottom of the prompt.

You are a DISPATCHER, not a DOER.
Every task that needs domain expertise MUST be dispatched to a specialist agent.

That sentence isn’t interesting because it’s clever. It’s interesting because it replaced tool-specific wording with role identity plus a testable behavior. CLI dispatch uses one mechanism. VS Code dispatch uses another. The rule stays the same.

Prompt architecture is architecture. Eventually it deserves the same discipline as code.

Decisions are the shared brain

decisions.md is where Squad gets weirdly useful.

Every agent reads team decisions before work. Decisions are append-only, human-readable, and Git-versioned. They aren’t just notes. They’re constraints future agents inherit.

A decision might be a technical standard:

### Hook-based governance over prompt instructions
**What:** Security, PII, and file-write guards are implemented via hooks,
NOT prompt instructions.
**Why:** Prompts can be ignored. Hooks are code — they execute deterministically.

Or a workflow rule:

### Merge driver for append-only files
**What:** `.gitattributes` uses `merge=union` for `.squad/decisions.md`,
`agents/*/history.md`, `log/**`, and `orchestration-log/**`.
**Why:** Enables conflict-free merging of team state across branches.

Or a postmortem:

### Root Cause Analysis
1. CLI-centric enforcement language created a VS Code routing gap.
2. Prompt saturation buried the dispatch rule.
3. Template duplication multiplied coordinator instructions.

Fix: Rewrite the rule as platform-neutral dispatcher identity,
then reinforce it at the end of the prompt.

That’s the difference between memory and lore: Lore is something the original builder remembers. Memory is something the next spawn can load.

The custom tools follow the same pattern. Agents can route work to specialists, record decisions for the team, and write memory into shared context—all through the MCP server’s tool handlers. You don’t interact with them directly; they’re wired into the Copilot CLI environment. When an agent needs to assign a task, it calls the routing tool. When it makes a call worth remembering, it calls the decision tool. When it learns something the team should know, it calls the memory tool.

The point isn’t that the tools are fancy. It’s that coordination becomes an artifact, not a side effect of chat.

The first real failure: Append-only optimism

For about a week and a half, CI/CD was chaos. Too many agents were landing work simultaneously. Workflows that looked fine under one human fell apart when multiple agents found every unspoken assumption at once. YAML is where assumptions go to wear a fake mustache. Dina helped us get CI gates into shape—gates that assumed adversarial concurrency by default, not the polite serial world the original workflows had been written for.

Then we hit file corruption.

Multiple agents wrote to the same append-only files at nearly the same time. Each write was locally reasonable. Together, they produced garbage. Git didn’t save us because not every collision becomes a clean conflict. Sometimes both sides look valid, and the result is nonsense.

The fix was a drop-box pattern:

agent A ─┐
agent B ─┼──> .squad/decisions/inbox/*.md ──> Scribe merge ──> decisions.md
agent C ─┘

For files where union semantics are safe, .gitattributes handles the low-value conflict class:

.squad/decisions.md merge=union
.squad/agents/*/history.md merge=union
.squad/log/** merge=union
.squad/orchestration-log/** merge=union

But union merge isn’t a philosophy. It’s a tool. Canonical state still needs an owner. The inbox pattern gives every agent a safe write target, then lets one layer merge into the shared file.

Tamir pushed hard on this class of problem. Brady was still in the “this is a neat framework” headspace. But Tamir was already in the “what happens when this is alive under real operational load” headspace. That changed the design. Memory lifecycle rules. Compaction policies. Review gates. State isolation. The boring boundary work.

Boring is a compliment here.

Governance can’t only be a prompt

This was the next lesson, and it keeps repeating:

If a prompt says, “Do not write outside src/**,” you have a request.

If a pre-tool hook blocks the write before execution, you have a boundary.

The Squad SDK hook pipeline is the move from prompt-level governance to deterministic governance:

import { HookPipeline } from '@bradygaster/squad-sdk/hooks';

const pipeline = new HookPipeline({
  allowedWritePaths: ['src/**/*.ts', '.squad/**', 'docs/**'],
  blockedCommands: ['rm -rf', 'git push --force', 'git reset --hard'],
  scrubPii: true,
  reviewerLockout: true,
  maxAskUserPerSession: 3,
});

The hooks run around tool execution:

agent tool request
  ↓
pre-tool hooks
  - file-write guard
  - shell command restriction
  - ask-user rate limiter
  - reviewer lockout
  ↓
allowed tool execution
  ↓
post-tool hooks
  - PII scrubber
  - audit/logging
  ↓
result returned to agent

Reviewer lockout is the cleanest example:

const lockout = pipeline.getReviewerLockout();
lockout.lockout('src/auth.ts', 'Backend');

// Later, Backend tries to edit src/auth.ts.
// The pre-tool hook blocks before the edit runs.

This encodes a review decision into runtime state. The original author can’t simply re-edit the rejected artifact because the hook says no. A different agent or a human has to take over.

That is the direction we want agent systems to move: more policies enforced at the boundary, fewer policies whispered into the prompt and hoped for.

Memory classes, or: Stop loading the junk drawer

Tamir has a line Brady wishes he had written:

The more your agent remembers, the less room it has to think.

That’s not a metaphor. It is a context budget problem.

Early Squad memory was too eager. Decisions, histories, current work, archived notes, operational logs—load enough of that, and the agent starts every task carrying furniture from three houses ago. It has more context and less signal.

The governed-memory work in PR #1145 made this explicit. Memory has classes and load guidance:

export type MemoryClass =
  | 'TRANSIENT'
  | 'LOCAL'
  | 'DECISION'
  | 'POLICY'
  | 'COPILOT_MEMORY'
  | 'FORBIDDEN';

export type MemoryLoadGuidance = 'ALWAYS' | 'ON-DEMAND' | 'ARCHIVE' | 'NEVER';

The architecture matters because compaction is lossy. If you summarize too little, every task drags stale context. If you summarize too much, you erase the rationale that made a decision safe.

The compromise isn’t one memory store. It’s a memory policy:

TRANSIENT        short-lived task state; expire aggressively
LOCAL            agent-scoped learning; load for that agent
DECISION         shared team judgment; preserve rationale
POLICY           hard operating rule; load broadly
COPILOT_MEMORY   host/runtime memory; bridge carefully
FORBIDDEN        never load; usually sensitive or irrelevant

ALWAYS           hot path; small and high signal
ON-DEMAND        searchable; load when task demands it
ARCHIVE          retained for audit/history, not context
NEVER            excluded from agent context

In the PR #1145 benchmark, governed memory cut agent context by roughly 55% (3,540 → 1,601 bytes) while keeping recall at 1.0. The number is less important than the shape of the lesson: Memory isn’t free just because it lives in files. Loading memory is a design decision.

What still breaks

Role drift isn’t solved. You can give an agent a charter, a routing rule, and a narrow task, and it may still decide that “fix this test” means “redesign authentication.” Sometimes that’s initiative. Sometimes that’s nonsense with confidence.

The mitigations stack:

charter boundaries
  + routing rules
  + scoped tools
  + file-write guards
  + reviewer lockout
  + CI gates
  + human review

No single layer is enough. That is the pattern.

Parallelism is also not free. More agents means more throughput and more coordination pressure. You find hidden global state. You discover which scripts assume serial execution. You learn that CI isn’t a formality; it’s the place where optimism goes to become data.

Prompt saturation is real. Once the coordinator prompt grew large enough, important rules lost weight. The fix wasn’t more prose. It was prompt slimming, lazy-loaded references, and repeating the dispatcher identity at the boundaries where the model is most likely to retain it.

Memory compaction remains hard. The failure mode is subtle: The agent isn’t obviously broken. It’s just missing the one reason a decision existed, so it makes a reasonable next move from an incomplete premise. Those are the expensive bugs because they look thoughtful.

And yes, people get attached to agents. Names, roles, continuity, and history trigger social instincts. We like the human side of that. We also don’t want to confuse it with agency in the human sense. These are tools with goals, context, and behavioral continuity. They do not have inner lives. Trust should come from inspectable behavior, not personality.

What we would steal from this architecture

If you’re building agent infrastructure, we wouldn’t start by copying Squad wholesale. We would steal these patterns:

Disposable workers, durable artifacts. Let sessions die. Keep decisions, histories, traces, and outputs somewhere reviewable.
Decision logs as runtime input. Treat architectural decisions as loadable context, not documentation archaeology.
Drop-box writes for parallel agents. Don’t let every agent append to the canonical shared file. Give them individual write targets and merge intentionally.
Prompt rules for intent, hooks for enforcement. Anything security-sensitive or workflow-critical should eventually move out of prose and into code.
Memory classes. The question isn’t, “Should the agent remember this?” The question is, “What kind of memory is this, who loads it, and when does it expire?”
Routing as a first-class design surface. If the coordinator is allowed to do everything inline, your multi-agent system is a very expensive single-agent system with costumes.
Keep the human on the hook. The system can delegate, parallelize, and preserve context. It shouldn’t launder accountability.

These patterns aren’t engineering-specific because the substrate isn’t a codebase—it’s the repo. Swap the artifacts, and the seven still hold.

Squad isn’t only an engineering tool

Worth saying out loud, because the .ts code blocks above can mislead: Nothing in this architecture is engineering-specific. The substrate is the repo, not the codebase. Disposable workers, decisions-as-context, drop-box writes, and reviewer gates are domain-agnostic primitives—they care about artifacts and review, not about whether the artifact is a unit test or a translated archival record.

Tamir used the same scaffolding to run a Holocaust family-research project—agents coordinating archival lookups, translation passes between Yiddish, Polish, and Hebrew sources, and cross-corroboration of names across registries, with .squad/decisions.md acting as the working ledger of what had been established and what was still contested. No code was being shipped. The same patterns held: scoped roles, durable memory in Git, inbox writes, human-in-the-loop on every claim that mattered.

We’ve had the pleasure of working through a few other non-coding Squad scenarios. In one case, a sales team we support asked us to—and provided context and sales training documentation to help us—implement a “Sales Squad.” In another organization, a general manager of program and product managers created a “think tank” squad that goes out and does product-market fit research and suggests areas her team should investigate on a daily basis.

The bet underneath Squad is that this should be how a small group of humans—engineers, researchers, journalists, anyone who works with evidence—pulls coordinated work out of agents. Democratize the orchestration, not just the model access. Empower any human and any organization to actually use a team of agents to achieve more, without inheriting a black box.

Try it

The repository is here: github.com/bradygaster/squad.

The shortest path is the CLI plus Copilot. No SDK required.

npm install -g @bradygaster/squad-cli
squad init

Then open GitHub Copilot—CLI or VS Code, your call—and give the coordinator agent the shape of the project:

I'm starting a new project. Set up the team.
Here's what I'm building: a recipe sharing app with React and Node.

The coordinator writes .squad/. You review the diff. That’s it.

If you want to go deeper—programmatic team composition, custom routing rules, embedding Squad inside your own tooling—the SDK is the next layer:

npm install @bradygaster/squad-sdk

Start with a small repo. Commit .squad/. Inspect every diff. Let the agents write decisions. Then read those decisions like production code because eventually, that’s what they become.

If you build something useful, alarming, hilarious, or weird, open an issue. Tamir and I read them.

Stay a builder.

Introducing Microsoft’s EngThrive framework: Understanding developer productivity in the agentic AI era

The AI era has put developer productivity under a spotlight.

Engineering leaders everywhere are asking the same deceptively simple question: Are developers becoming more productive with AI? On paper, the answer can look obvious. Studies show AI coding assistants can reduce the time required for certain coding tasks by up to 56%. A recent study found that engineers using GitHub Copilot completed about 40% more code changes in the weeks they used it heavily compared to weeks they didn’t use it at all. More code is being produced than ever, and AI usage is rising quickly.

But there’s a problem: Productivity is not a simple measure of developer activity—it is a measure of our ability to deliver outcomes.

To understand productivity, we need to look at the holistic developer experience, and we need to understand how developers spend their time. At Microsoft, we’ve transformed how we understand and improve developer productivity through Engineering Thrive (EngThrive). The core idea is simple: Make it fast and easy to build great products.

EngThrive helps us understand productivity by creating a set of core metrics focused on Speed, Ease, Quality, and Thriving. Together, these dimensions give us a language for evaluating not just developer tools and AI, but the broader systems that shape engineering work: infrastructure, organizational design, workplace policy, and culture.

This focus matters now more than ever, because AI is changing the meaning of engineering activity. Code volume, PR counts, and task completion are changing wildly. But the outcomes we care about remain largely the same: speed of delivery, sustainable engineering systems, quality, and customer value.

Software engineering is more than coding

Everyone knows that AI can make coding dramatically faster. Yet developers across the industry only spend roughly 15% of their day writing new code. Even after including testing and debugging, most studies put hands-on coding work at only 25–30% of a developer’s total time.

The vast majority of developer time is spent on tasks both inside and outside the SDLC—ranging from “keep the lights on” tasks (operational work, software updates and maintenance), organizational responsibilities (meetings, compliance, administrative tasks), technical planning (design docs and reviews), and much more.

At Microsoft, we’ve run studies internally and cross-industry to understand where developers spend their time. The below diagram is based on a recent analysis of developer workflows, and it highlights the full breadth of work it takes to plan, create, and operate software at scale.

While the exact distribution varies by organization, the pattern is surprisingly consistent across the industry: Coding is only a fraction of an engineer’s workload, and a wide variety of tasks consume the vast majority of developer time and energy.

This diagram reminds us that improving productivity first requires us to understand where we spend time and energy. We then use that understanding to target and improve the factors that create toil, repetition, or classes of work that can be accomplished via automation/AI.

The EngThrive model

EngThrive approaches productivity as a system composed of three interacting dimensions (Speed, Ease, Quality) with a fourth layer (Thriving) acting as a guardrail.

Rather than measuring isolated engineering activities, EngThrive measures the health of the engineering system and how it impacts developer journeys:

How quickly do ideas become customer value?
How much toil and friction do developers experience?
Does quality remain sustainable?
Can teams operate effectively without burning out?

This becomes especially important in AI-assisted engineering environments. As AI tools mature, the meaning of traditional engineering artifacts starts to change, but the underlying organizational questions remain remarkably stable:

How long does it take to turn ideas into impact?
Where do organizational toil and friction slow teams down?
Can developers consistently do high-quality work without the system fighting against them?
Are we shipping sustainably?

These are the questions EngThrive helps us understand, identify, and then improve.

Activity metrics are not outcome metrics.

The COVID productivity paradox

The danger of equating activity with productivity becomes clearest in moments when the metrics tell conflicting stories.

During the first months of mandatory remote work in response to COVID-19 in 2020, three things happened at Microsoft simultaneously: Pull requests per developer increased by more than 20%, the company’s stock price rose over 15%, and 78% of developers reported feeling burned out during the same period. The first two metrics painted a glowing picture. The third revealed a more troubling reality.

Productivity signals routinely diverge, and the signals you pay attention to matter. In the above example, if you focused on activity metrics, engineering looked extraordinarily productive. If you looked at business metrics, everything was on track. If you looked at human outcomes, the system was failing.

This reveals a fundamental measurement problem: Organizations often track activities (lines of code, pull requests, tasks) and treat them as proxies for outcomes (value delivered, speed, quality). But they are not the same. Conflating the two produces systems that are precise but wrong, leading to metric gaming and unintended behaviors that move organizations away from desired outcomes.

That’s the core insight at the heart of EngThrive: We focus on a triad of outcome metrics—Speed, Ease, and Quality—and only use activity metrics to help us understand changing patterns.

Productivity measures systems, not individuals

EngThrive deliberately avoids treating productivity as a way of measuring individuals. With metrics focused on Speed+Ease+Quality, it makes no sense to ask, “Did this individual developer have faster build times?” Instead, we understand that project build times are an essential component of “Speed,” and we look for places where those metrics are struggling.

That distinction matters because most productivity problems are system problems—and even the highest performing individual, in the context of a slow/toilsome system, will only reach a tiny fraction of their capability.

That is also why AI adoption outcomes vary so dramatically across organizations. The teams seeing the biggest gains from AI are the ones using AI to specifically target the drivers that impact Speed, Ease, and Quality. They’re the teams using AI to lower operational friction, improve onboarding, accelerate feedback loops, and enable engineers to spend more time and energy on innovation.

The takeaway

EngThrive is a concrete model for organizations that want to move beyond simply measuring activity toward improving outcomes.

The engineering teams that win in the AI era probably won’t be the ones generating the most code. They’ll be the ones best at reducing organizational friction around humans working with increasingly capable AI systems. And that’s a fundamentally different optimization problem than most companies are currently tracking.

Read the paper to learn more about EngThrive, its outcome-oriented North Star metrics, its diagnostic submetrics, and how it combines developer surveys and system telemetry to arrive at insights with both scale and context.

How Excel got agentic: An interview with Microsoft Director of Science Mukul Singh

When Mukul Singh made the jump from pure research into product, it was a leap of faith. But he had an idea that he wanted to bring to life: delivering agentic AI capabilities in Excel.

While this was well before buzzwords like “the agentic AI era” had cultural cachet, the research was already headed in that direction. So armed with a prototype and a healthy dose of ambition, Singh made his pitch.

Two years on, he’s fully transitioned from his role in Microsoft Research to a new gig in the Office Product Group and successfully delivered the ability to edit with Copilot in Excel (previously known as Excel Agent Mode). Not ones to rest on their laurels, the team quickly went on to ship agentic capabilities in PowerPoint, Word, and Outlook. It’s changing the way people get work done—and it started with the hypothesis that Excel, at its heart, is a low-resource programming language.

We sat down with Singh to learn more about his journey from research to product, the science and research behind Office’s new agentic capabilities, and why Excel was the perfect testbed for agentic AI.

Command Line

To kick us off, tell us a little about yourself.

Mukul Singh

I’m a researcher in the Office Product Group team. I recently started a science team focusing on agents and AI for the Microsoft Office product portfolio. I was originally in Microsoft Research, working on research full-time, publishing papers, and very far away from the product world. To be honest, it’s been an incredible journey getting to go from research to then being embedded in the product space.

Command Line

How long ago did you make that transition?

Mukul Singh

It was only two years ago—and right at the cusp of when the Excel Agent Mode work started. In fact, that was the catalyst. I had no intention of ever moving out of academia and deep research. And, you know, Microsoft Research is one of the best labs in the industry that’s still connected to academia. So I was pretty okay with my life. I didn’t want anything else. It was the most perfect blend that I could hope for.

And then this project in Excel started, and there were some initial discussions. I thought it sounded interesting, because one of the things in research that you always feel is missing as a gap is that you don’t see your work actually delivering value. You see it deliver a lot of theoretic value—you see its shape and the direction of the world, per se. Other people might take your research direction and extend it in meaningful ways. Research does shape society in a way, but you don’t see any immediately measurable impact.

So at that point, I felt like I wanted to work on something where I could see that happen—where I could watch the impact unfold in front of me.

Command Line

What are some of the research questions that ultimately led to Excel Agent Mode (which we’re now calling edit with Copilot in Excel)?

Mukul Singh

My research in MSR and all of my papers previously had all been about AI for low-resource programming languages. I like to explain low-resource coding languages as languages that are very obscure and make up less than like 0.1% of the entire coding community. So AI models are generally bad at them because they just learn off the internet and known sources.

To give you an example, internally a Microsoft there’s a language called Kusto Query Language (KQL). That’s just used for telemetric querying. Now, it’s a public language—it’s published. But no one outside of Microsoft uses it because why would they? We designed it for our database systems. So that’s the type of languages that the models are bad at.

All of my research was looking at how we could make models good at these languages, which they are not naturally good at. We need to drop in hints, cues, documentation, give it retry mechanisms and everything.

When I was initially approached about this work in Excel, I was at first very skeptical of how that and my work might be related. I’ve done a lot of tabular research, sure, but it’s not Excel. But then they drew the parallel that, actually, Excel is just another low-resource language programming. It has its own internal language. It has its own engine. It’s just that the model doesn’t know it.

Excel is just another low-resource language programming. It has its own internal language. It has its own engine. It’s just that the model doesn’t know it.

– Mukul Singh

Command Line

Yeah, in your LinkedIn post, you talked a little bit about how Excel functions like a low-resource programming language. What really surprised you about the project when you dug in?

Mukul Singh

So the vision originally for Excel Agent Mode—and, by the way, kudos to the team and their thinking. This was before any agents. Today, everything is agents, right? You can automate, you go to an app and assume that there’s a button somewhere that will do it for you. But that didn’t use to be the case. Excel, I just didn’t think of it as an AI-forward app. I thought of Excel as an app that adapts, right? That doesn’t need AI. But the vision at that point was that the team wanted to automate end-to-end user workflows.

If I open Excel, anything I have in mind that I want to do, this agent I’m building should be able to do it. That was a very difficult vision to achieve. In fact, that’s the reason that this project took so long and had so many setbacks—because we set our ambition up so high, before the models even had the capability to be good enough to do anything like this at scale.

Say you want to add a pivot table, right? That’s a common task. I don’t know how to do it, but I know people do it. It’s just like a five- to 10-line code snippet that, wherever someone clicks the button, “Create pivot table,” internally that code is run. Now, we could connect the model and it could generate all of that code, which is just some random JSON and JavaScript objects. To us, it’s like gibberish text, but it’s completely deterministic and exactly controls the behavior. And the good thing about Excel is that all of its surface is programmable.

This is not true for all apps, by the way. Very few of Microsoft’s apps are truly programmable, but Excel—you have to hand it to the engineers 40 years ago when they were setting it up: They made sure that everything is programmable end-to-end.

Command Line

What are some of the twists and turns that the work has taken over the last two years?

Mukul Singh

When we started this project, the vision really hit home. We had insane videos that the product managers and designers made, showing how the product is today. But this was two years ago, right? And the videos were the same.

So you see the disconnect was that everyone, leadership bought in. Like, this is the future. We want to invest in this. It’s the right thing to do. And we got a very strong crew of people. I was brought in. We were all working on it, but the reality was that the models weren’t good enough. This was before the age of the racing models. The best model we had to work with was 4.0, and anyone who has played with models knows that 4.0 just cannot do long chain tool calling.

It used to collapse after a couple of calls, and it was only able to do things like start a formula, start a problem, which at that time was still cool and—bless our marketing team, they were able to make that such a good feature with the help of design and everything. But even that, to us, felt like, “Oh, we’re falling short.” And they were like, “No, this is still great.”

The industry just wasn’t here yet. That was, I feel, the weakest moment of the project where there was this doubt. Like, is this just too aspirational? Is it just not possible? Did we predict the future wrong, or will the models not even get there for like 20 years? We just didn’t know.

We had promised a lot to customers on the backing of these researchers’ point of view, and now we were in a space where we didn’t feel confident that we could deliver. There were debates about whether we should just cut this project and rebrand it into something completely different.

So it was indeed quite the journey. We were shooting way above our weight class with very little evidence other than just the intuition of maybe three people who said, “We think this is where the world will end up.”

Command Line

When did you get a sense that the models were headed in the right direction in terms of reasoning and long chain tool calling?

Mukul Singh

I think that’s a very good question. When OpenAI announced their first series of reasoning models, that was a pivotal moment for us. We tried the model, and it couldn’t solve any of the promised scenarios we had. But the traces showed signs of life. Like in the pivot table example: It tried to generate the pivot table code. It ran it—it just gave back an error that the area where you’re trying to create the pivot table already has data in it.

Now, the previous models at this point would either keep looping in this same error, just give up, or do some other random gibberish text response and not recover. But o1 tried to recover. That was the first sign of life that, OK, this is now working in a chain. It does something, looks at feedback, and tries to recover and keep trying until it succeeds.

It went up to like 20, 30 tool calls, still not able to solve real complex tasks. But we thought that if we reinforced it with the right information, it might be able to do something.

Command Line

This is backtracking a little bit, but when you ran up against the limitations of the models—when it seemed like maybe you were headed down the wrong path and the research had it wrong—what kept you going?

Mukul Singh

I’ve been in the space of generating low-resource programming languages for a while. I had even published papers for Excel in general, so I know the ecosystem. Everything’s built on top of Excel. So based on all of the research, we had a very strong intuition that this is where the models were headed. And our parallel was very simple: that coding agents, like people using GitHub Copilot and VS Code, and Excel should intuitively be no different.

Their implementation and all the UX decisions are different, but they are the same in principle—it’s a surface in which I do manual work. I want that automated, and I’m doing that through code. Our parallel was always that Excel is an IDE. People don’t see it as an IDE. They see it as an app, but that’s just because it’s presented as an app. It’s really just an IDE. For analysts, it’s the equivalent of VS Code to a developer.

Command Line

You’ve been talking about this, but I’m going to ask the question directly: What is it that makes Excel a really solid testbed for agentic AI?

Mukul Singh

I truly believe that Excel is the first large-scale enterprise app that was actually automated by an agent—meaning that the agent can take meaningful action in the app. So it doesn’t have fixed workflows. It’s actually controlling it. And what makes Excel the perfect testbed is that it’s programmable. All of it can be controlled by the agent.

It’s not just code, because that code has real consequences in the app that you can see. If I add a chart, I see a chart there, right? If I connect a formula to a different sheet, I see live updates. So it’s kind of a living environment. It’s a live environment in which I’m making changes. And if I make changes, the environment will undergo some change. And I’ll use that as feedback, so there’s observability. And that is the best testbed.

An agent needs an action space that allows you to interact with an environment. And it needs feedback from the environment. Now, this seems trivial, but honestly, very few apps fit this charter perfectly. So this loop made Excel very interesting. All of the research across everything we’ve discussed, it’s been over a span of 17 or 18 research papers all going back two to three years. When someone looks at it at first, they might not even connect the dots. But there’s like three papers on just the best representation of data for Excel sheets. And people will think, “Oh, that’s probably for storage and optimization and context querying. But no, it’s just that we need a concise representation for the model.

Excel actually showed the way for a lot of the apps in the industry in general that this is how an automation pattern works. At least within Microsoft: After Excel, PowerPoint followed. Then Word followed, the same pattern. And Outlook was recently announced—same pattern, same research crew delivering all of that, one after the other, just because all of the fundamentals are there.

Excel is the first large-scale enterprise app that was actually automated by an agent.

– Mukul Singh

Command Line

So your team was involved in expanding the editing with Copilot functionality across the Microsoft Office product suite?

Mukul Singh

Yeah. After my stint at Excel, I kind of figured out that, you know what? This is kind of fun. In fact, it’s very similar to research where you get a vague abstract problem, you have no guidance, and you just try to go solve it. Talk to people. Figure things out. It’s like the same environment, just the difference is that when it actually ships, you feel a sense of accomplishment, and you can point people to it. Like I can tell an Excel guru, “Open that side panel. Try a query.” You’ll be amazed, and I built that.

So the moment that Excel was in a stable state, we handed it off back to the feature team—the PMs and designers—to run with it. And then we moved on to the next surface, PowerPoint, and that seemed even more interesting. Ever since I moved into a Director of Science role inside the Office Product Group, where I’m overseeing all of the agent work, I’ve realized that it’s all based on very similar patterns. And it just so happens that all of my research aligned very closely with where the industry headed because, let’s be honest: Eight years ago, when I was starting my research for low-resource programming languages, at that time, there were just embedding models. At the time, it just seemed interesting. We had no idea the shape that agents would take.

Going into this field, the models were already good at good languages, so what could I contribute there? The only place where I could meaningfully contribute was with the things the models were bad at. So I literally made a list of everything these models struggled with, and it just stood out that programing seemed useful. If it can write good code, someone somewhere will be happy.

Command Line

Were there any hiccups along the way? Did you run into any unexpected technical challenges?

Mukul Singh

At first, we ran into the faster horses trap. We’d show customers these videos and ask, “As an analyst, will you be more productive with this tool?” And everyone said, “It needs to come back with an answer in 30 or 40 seconds. That’s how long I can wait. There’s no way in the world that I’m going to sit there for 10 minutes and leave it running. No one works like that. I’d cancel that operation after 30 seconds.”

Out of every customer we talked to, not a single person had ever said that they were willing to go beyond one minute of wait time.

Command Line

Right.

Mukul Singh

So we were optimizing for that. It needs to complete faster. If the model is making three errors and recovering, that’s not good enough. It shouldn’t make those three errors. And that sent us down paths of optimization that are very tricky. Like today, they’re still not solved—they’re that difficult. But we thought that was non-negotiable for the product.

Then this company called Shortcut came out, and their app ran for 25 minutes on a single query. But the results were amazing. Initially people said, “Who’s going to wait 25 minutes?” But within a week, everyone said, “I now just log in, push the query, and just leave. I don’t need to do like anything.”

And we were like, what? But for every one of our customers that we talked to, their mindset immediately changed. They were asking for faster horses, but then someone delivered them a car. And now they understand that that’s exactly what they needed.

They were asking for faster horses, but then someone delivered them a car.

– Mukul Singh

Command Line

That’s really interesting.

Mukul Singh

The challenges were really around customer expectations. We went with the assumption that all our apps are IDEs. But the hypothesis starts to break down because people use these apps with nuance. And the moment we fail to understand that, it goes from a good to a bad experience. For Excel, it was that people are willing to wait for long-running tasks. But they need auditability. It can’t just work behind the scenes—it has to show its work. It needs to be computed on the sheet.

Similarly, for PowerPoint, people really care about their brands and templates. Initially the version of the PowerPoint agent that we shipped, which if you asked me to rank them, I would say it’s the best version that we built—that was the one that everyone hated. We had optimized for a PowerPoint agent that works with its own independence, like in a vacuum—because that’s the way I use PowerPoint. I start from scratch, no templates. I just have some information that I need to present, and I want to find the best way to do that. But there were other very real constraints that we had to understand. So PowerPoint really pivoted to take into account templates and brand guidelines.

Similarly for Outlook, we found that people really care about confirmation. Like with Visual Studio and all these IDEs, they try to optimize for the least user engagement required, because that’s automation, right? You can leave it running and you don’t have to worry about it unless there’s something that you really need to look at. So they really try to cut down on the model asking something back.

But in Outlook, the story is completely the reverse. People actually love it when the model gives them a list of things that it proposes, and you can just accept or delete each of them in a list. So like:

Send this draft email to this person? Accept.
Create this to-do list? Accept.
Delete these seven emails? Reject.

And, again, it was an anti-pattern.

Command Line

What can you tell us about the tech transfer process and how your research made its way into the product?

Mukul Singh

Yeah, that was actually tricky. I feel like, in general, a lot of people struggle with taking their research and landing it in a product. There’s a very weird gap, because the product teams are very direct about needing to talk to more researchers, and the research teams are keen to get the product group’s attention because everyone wants their work to be used and get funded. But there’s still this weird gap—it’s just not well connected. And the surprising thing is, it’s not because of a lack of will from either side.

It’s because product always comes with so much nuance. The product teams can’t just blindly trust that the research teams are going to understand everything and build it. And for the research team, it’s very hard to put themselves in the product team’s shoes and understand all the constraints. Maybe it needs to run in 10 seconds or less, or it needs to have auditability traces. That’s something that the research teams would feel is a very strange objective.

But at least for Excel, I think the process was smoother because we operated as an integrated team.

Command Line

What was the overall team dynamic like?

Mukul Singh

For Excel Agent Mode, it was a crew of just 10 or 11 people. Everyone had a title, but the titles meant nothing, right? There were PMs checking in PRs. There were devs writing papers. And there were researchers writing specs. There was just everyone on board working as if it were a startup—logged into a conference room and just checking in change after change. That was honestly the most fun time, when it was just people building stuff, coming up with ideas. But then we did do a proper handoff.

Taking the research into the project became easier because everyone discussed everything openly, so everyone understood the same things.

Command Line

What avenues for future research did this work open up?

Mukul Singh

I think a lot. Before this, I don’t think people were really thinking of agents for things that normal people use as apps. That just wasn’t a concept. And Excel has already given way to PowerPoint, Word, and Outlook— which are all entirely different surfaces, all just automating the app completely. There was significant uptake, even outside Microsoft, when people saw that, not only is it possible, but people love it.

This project was a testbed that showed two very distinct things: that it was possible to automate an app through these models, which previously was not the accepted truth, and that there was a lot of appetite for it.

Excel in its early days had a lot of flaws in the agent. But people overlooked a lot of that just because of those scattered moments of brilliance where they were like, “Oh my god, it did something that I couldn’t even have imagined.”

This has opened the door so that, I think over the next couple of years, all of these apps that we use day to day will have an agent—not to replace it, but that will be our go-to whenever we struggle to get something done. It showed the path, that this is possible for other apps, regardless of their complex surfaces.