Skip to main content Signal blog Official Microsoft Blog Microsoft On The Issues Asia Canada Europe, Middle East and Africa Latin America The Code of Us Conexiones What's new today AI Innovation Digital Transformation Sustainability Security Work & Life Diversity & Inclusion Unlocked Microsoft 365 Azure Copilot Windows Surface XBOX Deals Small Business Support Windows Apps Outlook OneDrive Microsoft Teams OneNote Microsoft Edge Moving from Skype to Teams Computers Shop XBOX Accessories VR & mixed reality Certified Refurbished Trade-in for cash XBOX Game Pass Ultimate PC Game Pass XBOX games PC games Microsoft AI Microsoft Security Dynamics 365 Microsoft 365 for business Microsoft Power Platform Windows 365 Small Business Digital Sovereignty Azure Microsoft Developer Microsoft Learn Support for AI marketplace apps Microsoft Tech Community Microsoft Marketplace Software companies Visual Studio Microsoft Rewards Free downloads & security Education Gift cards Licensing Unlocked stories View Sitemap

By builders, for builders.

A Microsoft publication

Composing a new platform for agent-first devices

Abstract

What changes when agents become both a new unit of programming and an emerging new unit of human-to-machine interaction? The mission of Project Solara, a new software platform coupled with tailored hardware solutions, is to pioneer agent-first experiences that are shaped around you: your agents, your tasks, your environment, under your control. So, what’s different this time from previous generations of computers? Agents and AI accelerate the creation of even more specialized computers without incurring the full cost and tradeoffs that in the past limited the creation, diversity, and specialization of those new forms. We imagine a diverse ecosystem of agent-first devices, from small to large, from fixed to hypermobile, from personal to professional. We’re starting this journey with two concepts designed for the enterprise—and we’re excited to navigate this transformation with you all. 

I manage the Applied Sciences Group, an interdisciplinary team that brings together product engineering, research, and the sciences to explore what comes next in computing. The rise of agents is changing not only how software is built, but how people interact with computers—and ultimately, what new kinds of computers may become possible. We are excited to give you an early look at where we believe computing is headed, and what the next computer may look like. 

The next computer

When we think of a computer, we tend to picture something familiar: a laptop, a phone, maybe a tablet. But computing has never really stood still. It keeps moving closer to us, closer to the work, closer to the moment where it can provide the most value.  

Mainframes did not disappear when PCs arrived. PCs did not disappear when phones arrived. Phones did not disappear when watches arrived. Each new form became more specialized, closer to you, closer to the solution you need. Each one found a new place in our lives because it was better suited to a specific context, a specific task, or a specific moment. So, what’s next?

Agents as the new interaction technology 

At Build 2023, I shared my perspective on three emerging AI application structures, shaped by how AI functions relative to your application: Is the AI beside your app, inside it, or outside it? 

In the first application structure, the AI is beside your application, it’s like a helper. It keeps the original app architecture and is minimally disruptive to what our customers already know.

In the second application structure, the AI is inside, as part of the main scaffolding; it becomes the main input loop. Here, AI is used to redefine the application’s interaction model and even its purpose. The experience becomes less dependent on point-and-click commands and becomes more automatic. This is where we are seeing the emergence of agents (for example, Researcher and Agent Mode in Office) and AI-first applications. 

The third AI application structure is where AI moves from operating within the application frame to operating outside it, globally. Here, AI orchestrates across multiple apps and services, allowing the agent to connect, coordinate, and maintain context across entire workflows, across devices, and even across very different timescales. Current examples include the recent emergence of various claws (like OpenClaw and Lobster), coworker-like agents, and similar systems. 

And so here we are today where agents are a new unit of programming and the new unit of human-to-machine interfaces, changing the way people interact and use their computers. And as we have seen many times in the past, new interaction technologies enable new types of computers.

New interaction technology enables new types of computers 

Every new computer form factor follows this pattern shown above. A jump in processing power, both in the cloud and at the edge, has enabled us to create hyper-complex software (AI), making agents possible. Through these agents, human language and dialog is the new interaction technology. For the first time in our history, we can program, direct, and initiate action with computers the way we talk with each other. This higher mode of interaction enables the computer and us to be less dependent on the traditional way we have interacted with computers via keyboards, screens, or even premediated apps. … And because of these trends, we are seeing a major opportunity toward new types of form factors.  

As AI streamlines the traditional development stack, these emerging form factors make it possible to bring agents into places, workflows, and moments that previously were difficult or cumbersome. A more specific and better tool for more specific tasks.  

That is the opportunity in front of us: agent-first devices.

Agent-first devices accelerate specialization  

Historically, specialization has been expensive. If you wanted to create a new type of computer, you had to build almost everything: hardware, software, services, developer tools, UI patterns, management systems, security models, and an ecosystem. This custom stack has been both a hurdle and a moat for new computer form factors.

Take a look at the diagram above, which illustrates the typical technology stack for a computer. Not just for laptops, but phones, watches, wearables, industrial devices, and so forth. Each layer in that stack represents a major company or even an entire industry. Bringing a new type of computer to market has historically required building out or modifying nearly every layer. This is expensive, difficult, and takes time. But what if it didn’t have to be that way?  

AI, and the new agent interaction model, reduces this burden. AI introduces new UI and app model flexibility into those layers. With just-in-time UI (see below), fewer apps need to be written for specific hardware implementations. With agentic coding, less effort needs to be spent refining a developer SDK for human consumption. As agent-only experiences grow to cover more of users’ needs, less of the traditional UX surfaces (like app frameworks or even browsers) need to be implemented for the specific hardware. The boundaries between those layers will blur and, in some cases, disappear.  

Therefore, agents enable us to create new types of computers that are more specific, more contextual, and closer to where they add value, without rebuilding the entire stack every time. This is the mission of Project Solara. 

Introducing Project Solara 

To enable this new era, we are introducing a chip-to-cloud platform, codenamed Project Solara, designed from the ground up for agent-first experiences and the new device form factors they enable. Chip-to-cloud sounds funny, I know, but what it really means is that the “operating system” is liminal, transcending the device and the cloud. The system brings a lightweight window to the edge, where the agent manifests and where the state, via Azure, can encompass a constellation of specialized devices. 

This is not just about bringing intelligence to the PC, the browser, or the phone. It is about bringing intelligence into the places where people need it most: in the flow of work, in the environment, and closer to the task at hand. 

We are building this platform on a simple premise: The next platform shift is from apps to agents—from software you open to intelligence you invoke; from graphical interfaces of buttons to expressing intent through agents; and from AI operating inside your applications to agents working outside and across your apps, workflows, and devices. 

This is not just about asking an agent questions. It is about giving people a more direct way to reason over their work, context, tools, and workflows—without navigating every app, notification, or interface layer. 

And because we believe the future will not be defined by one agent, Project Solara is designed for an open, multiple-agent world. Organizations will use Microsoft agents where they add value. They will also source or build their own agents for their specific workflows and requirements. 

The platform must bring these agents together coherently, while respecting boundaries between data, domains, identities, and organizations. That is why enterprise manageability, identity, security, privacy, and user control are not afterthoughts. They are part of Project Solara’s foundation. 

We are also investing in just-in-time UI: the ability for an agent experience to adapt across devices and modalities without requiring developers to redesign everything for every new form factor. Today, that means semi-structured approaches like adaptive cards and known content types. Over time, it moves toward more dynamic and generative interfaces. This is what makes specialized form factors viable. 

We are previewing concepts that explore two very broad categories: stationary and portable. Both are multimodal: glanceable access, voice, vision, and getting to the right agent at the right moment. And investigating several verticals across healthcare, retail, the financial industry, and more.

Every place where compute can add value becomes an opportunity to help users achieve more. Every workflow, every environment, every role can have a more specific tool. Not devices built around apps, but devices built around agents—that is the promise of Project Solara. It’s a new way to bring intelligence into the moments and places where people need it most. 

We are still early. I don’t want to over-promise. But I also don’t want to understate the significance of the shift. When the cost of specialization drops, innovation accelerates.

More details… 

Project Solara is specifically designed for the new era of agent-first devices. It establishes hardware and software requirements that will meet enterprise needs for manageability, security, and privacy, while ensuring critical user experiences are delivered. 

The cloud is not the only place intelligence lives. The agent sits between user intent and distributed execution. The UI becomes more like an adaptive access layer. The device becomes a window into long-running intelligence and action. A human-scale interface layer between the person and a larger intelligent environment.

Three pillars to the platform: 

  1. Enterprise-readiness, with privacy, security, control, and trust 
  2. Agent-driven interaction model with just-in-time UI 
  3. Extensibility to bring your own agents

Enterprise-readiness, with privacy, security, control, and trust 

Seamless access to your agents must be balanced with transparency and control, so enterprise customers, device users, and the people around them can understand and control how these devices are used. 

We are building the Project Solara platform to support enterprise-level hardware and software manageability, security, and privacy protections to securely access services such as WorkIQ. Project Solara includes reference designs that are flexible to modify to accelerate building and customization. 

Device-side attributes of Project Solara: 

  • Microsoft Device Ecosystem Platform (MDEP) is an enterprise-grade operating system built on AOSP, designed to meet the highest standards of security, reliability, ease of deployment, and innovation—enabling device makers to build and deploy at scale.  
  • Agent Shell that can dynamically load and tailor multiple cloud-based agents.  
  • Microsoft Intune allows IT administrators to manage and secure these devices just like PC and mobile devices today. 
  • Entra ID so users can use their existing Microsoft accounts. 
  • Hello for Business with at least one biometric authentication method, like facial recognition or fingerprint, allowing seamless access to the device. 
  • Easy privacy controls like a physical mic mute button, and clear indicators when listening or recording. 
  • Approved chipsets accompanied with applicable reference designs.

These attributes represent our current thinking and will continue to evolve as we continue to build out the platform.

Agent-driven interaction model with just-in-time UI 

These new devices are not meant to run traditional apps. They are designed for agents. That shift gives us more flexibility in the user interface, because the experience can adapt to the device, the screen size, the content, and even the mode of interaction—whether visual, voice, touch, or multimodal.

Every new device form factor has traditionally required its own application model, UI patterns, and optimization work for screen size, resolution, runtime, and input method. That is one reason new device categories are so expensive to build, and why they can struggle without a strong app ecosystem behind them.

AI changes that equation. We are already seeing models generate content, images, and layouts tailored to different contexts. If those capabilities become part of the agent loop, an agent can adapt its visual, voice, or multimodal interface to the device it is running on, without forcing developers to redesign the experience for every form factor. We call this broader capability just-in-time UI

Just-in-time UI exists on a spectrum defined by how much structure is required to render an experience. On one end is responsive UI: highly structured interfaces that reflow predictably across screen sizes. On the other end is fully generative UI: a future state in which AI can create the interface frame by frame with minimal predefined structure. That future is not here yet, but we can already see early signs of it.  

Today, Project Solara is intentionally building for the middle of that spectrum—beyond traditional responsive design, but not dependent on unconstrained generation. That gives agents enough flexibility to adapt their presentation across very different devices while preserving consistency and usability. In practical terms, the same agent can render a custom experience on multiple screen sizes and modalities with little or no additional work from the developer. For us, that is the first proof point: a path to specialized devices without requiring developers to rebuild the experience from scratch each time.

Extensibility to bring your own agents 

One of the most important realities of this new era is that there will not be a single dominant agent

Instead, we are entering a world of many specialized agents, each optimized for different skills (coding, communication, analysis, etc.), datasets and domains, organizational scopes and requirements. Just like no single app could replace Word, Excel, and PowerPoint, no single agent can meet every need. 

This creates a critical challenge: How do you bring multiple agents together into a coherent experience? The most straightforward approach is manually launching agents like launching apps. But soon the user will want more sophistication, more automation, and more coordination. We are working on various software technology for delegation to specialized agents, like an agent dispatcher and an agent task manager, which can automatically activate or surface agents when needed. 

Concept reference device designs  

We’re developing concept designs to test and pilot the Project Solara platform. These concept devices are not meant to define the limits of the platform, but to show the range of what becomes possible across stationary, portable, wearable, and hyper-mobile experiences. 

While these designs may not become the exact shipping experience, they help inform the platform and experience needs to get us started—and show the power of an agent-first interaction model: devices can be shaped around the agent, the environment, and the workflow, instead of forcing every use case into the same general-purpose form. 

Silicon partners 

MediaTek and Qualcomm are the first silicon partners working with us to deliver solutions to support Project Solara, starting with initial concept designs and expanding to a broad set of form factors in the future. 

With Qualcomm, we’ve worked closely on a portable-device concept-reference design. Qualcomm is a leader in silicon for wearables and other new form factors for intelligent devices.

“Microsoft’s Project Solara is an important step in advancing agent-first experiences across a wide range of devices and form factors,” said Dino Bekis, Qualcomm Senior Vice President for Personal and Wearable AI. “With deep experience enabling the majority of today’s wearable experiences and bringing advanced AI to billions of mobile devices, Qualcomm Snapdragon platforms are uniquely optimized for agentic AI—combining high performance with industry-leading power efficiency. We’re proud to partner with Microsoft to help accelerate this next era of intelligent, personalized computing.”  

With MediaTek, we’ve worked closely on the development of a stationary device concept design. MediaTek has deep expertise and a breadth of device partners across the IoT ecosystem. 

“At MediaTek, we’re bringing intelligence to edge devices with best-in-class silicon,” said Vince Hu, MediaTek Senior Vice President & General Manager, Data Center & Computing. “Microsoft’s Project Solara platform will significantly accelerate the opportunity for agent-first experiences and devices. We look forward to our continued collaboration, building from the first device concept to an extended ecosystem of Project Solara-powered devices.” 

Portable reference design: Badge concept device  

We’ve reimagined a form factor that information workers, nurses, front-line workers, and millions of others use every day: the access badge. This on-the-go, lightweight, always connected companion empowers each person to do more by having their agents always by their side. 

Device capabilities include: 

  • Touchscreen display
  • Hello for Business fingerprint sensor button, allowing secure access to the device and agent
  • Privacy switchand volume controls
  • Far-field high SNR microphone array and speaker 
  • Side-facing camera
  • WiFi, Bluetooth, GNSS, and 5G wireless connectivity 
  • Qualcomm wearable silicon

With Hello for Business with fingerprint recognition, you are always a touch away from your agents, so you can quickly glance at what’s coming up next with your Priority Agent, or be one tap away from recording an impromptu hallway conversation with Facilitator

Using the integrated camera, the platform allows agents, with user permission, to better understand and help take action on the environment around them. 

In-place reference design: Desk concept device 

For our next concept, we thought deeply about where many of us spend a lot of time today already: our desks. Whether your desk space is limited, or you’ve maximized your config with multiple monitors, we’ve designed a humble yet helpful companion providing frictionless access to your agent to help you stay in your flow. 

Device capabilities include: 

  • Touchscreen display 
  • Hello for Business with face authentication
  • Privacy lock buttons 
  • Microphone mute and volume buttons
  • Dual far-field microphone array and full-range speaker
  • UWB presence sensor
  • 2 USB-C ports for power and optional external display or peripheral
  • WiFi and Bluetooth wireless connectivity
  • MediaTek IoT silicon 

Hello for Business enables enterprise grade protection and enables frictionless authentication to glance access your calendar, stay on top of only the most critical items through curated Priority Cards, or tap into the ultimate thought partner with Microsoft 365 Copilot voice that is grounded on your WorkIQ data.  

This desk concept can work stand-alone, serve as a companion to your Windows PC, or even become your cloud PC through Windows 365 when connected to an external display. As a companion, it pairs with your PC via Bluetooth, enabling you to hand off tasks between the devices and keep lock state consistent. Plug in a display via USB-C, and the desk agent device can transform into your Windows 365 client—providing access to both the power of your full Windows 365 experience and the benefit of an agent-first device experience.   

Together, the badge and desk concept devices show what becomes possible when agents are no longer confined to one app, one screen, or one device. They show how agent-first experiences can move across stationary, portable, and wearable forms—adapting to the user, the context, and the work. 

Real-world piloting 

We are using these concept designs to inform how these form factors and platform can be built. They will become reference designs for the ecosystem to build turnkey solutions. Inside Microsoft, hundreds of employees are already using these concept devices to improve their workday 

Here are some of the ways we and our partners are using, building, and experimenting with Project Solara to help users be more productive: 

Microsoft 365 ecosystem  

  • Microsoft 365 Copilot, through conversational voice, is available at tap or (optional) wake word, allowing you to securely access your data, grounded in WorkIQ. Copilot provides daily briefings, becoming your ultimate thought partner to brainstorm, explore ideas, take action, or get coaching. 
  • Researcher can now help you keep tabs on your long-running projects by providing a more direct way to reach and respond to prompts and share reports when complete. 
  • Facilitator is more accessible, allowing users one-tap access to securely record an in-person meeting, with all the power of transcription, detecting action items, and ensuring this information is grounded in WorkIQ. Never miss an important outcome or struggle to find your notes.  
  • Priority Agent is an experimental agent our team is developing to bring actionable insights and actions directly to you. Grounded in signals across WorkIQ, Priority Agent provides the answer to “what needs my attention right now?” Priority Agent dynamically curates this list, adding and removing items intelligently, so you only glance at what’s needed now.

We are also partnering with other teams across Microsoft to explore how Project Solara can help deliver additional value for users:  

  • GitHub Copilot is exploring how an agent-first approach helps keep developers more in touch with the progress of their coding projects and providing faster ways through new modalities like voice to get things done.  
  • Dragon Copilot is exploring how agent-first experiences can better support physicians and nurses in the flow of care—helping capture interactions, surface relevant information in-context, and follow through on critical tasks without interrupting their day.

We’re excited to see how the agents from other third parties will find value and reach users in more direct ways, in more natural modalities. Here are ways you’ll be able to build for Project Solara devices: 

We’ll have more to share on other ways to build agents for Project Solara devices in the future. 

Private pilot program 

In the coming months, we’ll begin piloting this agent-first device ecosystem with industry leaders like AccuWeather, Best Buy, CVS Health, Levi’s, Target, and others.  

Platform ecosystem 

Realizing the Project Solara platform vision requires close connections across silicon providers, device builders, agent developers, and customers, especially in the early phases of learning and iteration.  

We will extend our collaboration with silicon partners to create reference designs for a range of categories spanning portable, ultra-portable, wearable, desktop, and others. 

With those reference designs, we’ll enable OEMs and product makers to develop specialized solutions for specific scenarios, environments, across a variety of  industry segments—spanning healthcare, retail, hospitality, financial services, legal, industrial, field service, and more—while meeting the needs of enterprise security and management, and seamless access and control for users. 

Agent builders will be able to reach more people in more places, using the adaptability of the Project Solara platform to bring their agents into the workflows, environments, and moments where they can create the most value. 

People, companies, and other institutions adopting Project Solara will shape the agent-powered, problem-solving experiences that they need.  

Together, we will unlock the creativity and energy to establish a broad set of agent-first solutions, empowering everyone to achieve more. 

Closing thoughts 

I’m excited to share this shift and how we are building a new platform to help usher in a new era of agent-first experiences and devices with our partners. 

This is where computing and new types of computers are headed. And importantly, this expands the reach and value of the agents and automation you are already building today. 

A device on a desk. A device worn in the field. A device in a hospital, a store, a factory, a school, or a home. Each one becomes a new access point for your agents, and a new way to bring productivity, intelligence, and assistance into places where computing has not reached as naturally before. 

Agents will reshape not only software, but the devices themselves.

Because now you can imagine something more: not just an agent inside an app, but an agent delivered through a device purpose-built for a specific place, a specific workflow, and a specific job to be done. 

That is the bigger opportunity. 

For agent builders: Think big. The agents you are creating today will not be limited to the screens and devices we know today. They will be able to show up across a variety of new form factors—devices designed around them, tuned for them, and deployed into the moments where they can create the most value. 

So, if you are developing agents today using Microsoft 365, Copilot Studio, the Microsoft 365 Agents SDK, and if you are using Azure to cloud-scale your solutions, then you are already taking the right steps to be ready for this future. 

Project Solara is about making that future easier to build, in a way that is open, secure, manageable, and scalable. 

We are still early, and there is more to come. And to me, the direction is clear: Agents will reshape not only software, but the devices themselves. 

And I cannot wait to see what you build. 

Grounding at scale: Engineering the retrieval system for the agentic web

Humans and AI don’t search the same way. As people increasingly turn to chatbots and agents for information, grounding that AI—connecting it to fresh, relevant, and authoritative information—takes on new importance as foundational infrastructure. Microsoft’s grounding layer already powers most of the world’s major AI assistants. And today at Build, we took that work further with Web IQ, a new grounding system for the agentic web. 

Web IQ delivers industry-leading quality, sub-165ms P95 latency (~2.5× faster than the nearest alternative), token-efficient retrieval, respecting publishers’ preferences. The same infrastructure powering Copilot, ChatGPT, enterprise systems from Nasdaq, and others, is now available as a neutral, MCP-native, model-agnostic platform. In this post, we’ll explore the architectural challenge, the Web IQ stack, and how we optimized for speed at scale.  

Grounding redefines the optimization problem 

Most discussions of AI systems still start with models. But once those systems are deployed at scale—especially in search, copilots, and agentic workflows—the dominant bottleneck shifts. The central problem becomes grounding: what information reaches the model, how fresh it is, how much context can be included, and how quickly that evidence can be delivered. 

In a grounding system, those requirements collapse into three tightly coupled constraints: latency, quality, and token efficiency. 

In classical search, these dimensions can often be traded-off relatively independently: A slower system can still be useful if it returns strong document results, and an imperfect ranking can still succeed if the user can inspect and repair the outcome. Inside an AI inference loop, that decoupling disappears. 

In AI search and agentic systems, grounding sits inside the inference loop. Retrieval directly shapes generation, tokens determine both cost and latency, and missing or stale context propagates into reasoning errors rather than degrading gracefully. The optimization target is therefore no longer a ranking function in isolation, but rather a coupled system operating under latency, quality, and token-efficiency constraints. In that setting, grounding goes from a component to a system architecture problem. 

Semantic‑first as a system design principle 

Before describing Web IQ’s architecture, it helps to name the underlying shift more precisely: Large‑scale retrieval is moving from hybrid stacks, where lexical systems dominate first‑stage recall and dense models re-rank, toward semantic‑first systems in which representation learning defines the primary retrieval space. 

That shift is now practical because modern embedding models preserve substantially more of the relevance signal at retrieval time, and ANN infrastructure is mature enough to search that space under production latency constraints. Just as importantly, retrieval is no longer limited to one vector per document. Instead, the effective unit can be a passage, span, or a small set of learned representations that retain finer interaction structure until late in the pipeline. 

  • Content is indexed as semantic representations rather than only lexical postings 
  • Candidate generation operates over neighborhoods in the embedding space, often at passage or sub-document granularity
  • Fine relevance signals can be deferred to later interaction stages instead of being collapsed entirely into a single early score
  • Lexical matching remains useful as a constraint, calibration signal, and fallback for exactness-sensitive cases

Rather than eliminate hybridization, this relocates it. In a semantic‑first stack, dense retrieval becomes the default access path, while later stages recover precision through richer interaction, filtering, calibration, and task-specific refinement. That choice propagates through the system: how content is chunked, how representations are trained, what the ANN index must preserve, and how evidence is assembled for downstream reasoning. 

This direction has been visible inside Bing for years: shift more of the retrieval quality into learned representations, reduce dependence on head-query interaction logs, and expose content that lexical access paths and click priors systematically underserve. The long-term implication is a retrieval stack whose first stage is semantic by construction and whose later stages recover fine-grained matching only where it matters. 

Web IQ is the first grounding system built end-to-end around that retrieval premise. 

A reference architecture for grounding: The Web IQ stack 

At the base of Web IQ is a retrieval system operating at global scale, but the key design choice is that documents are no longer the primary unit of access. The system is organized around both semantic representations of content and the operational question that follows from that choice: how to search a global embedding space with high recall, bounded latency, and enough structure preserved for downstream grounding. 

That immediately elevates two components from implementation details to system primitives: the embedding model, which determines what notions of relevance are geometrically recoverable, and the ANN index, which determines whether that geometry can be searched fast enough and updated often enough to reflect the live state of the corpus. 

Harrier: Embedding as the geometry of the system 

In a semantic‑first system, the embedding model defines the retrieval geometry. It determines which documents, passages, or sub-document units are near a query, which distinctions are preserved under compression into vectors, and which relevance signals must be recovered later through more expensive interaction. 

Formally, Harrier, our family of custom-trained and open-source multilingual text embedding models, learns a mapping:f𝜃:textd

sqx=f𝜃qf𝜃xf𝜃qf𝜃x

The formulation is simple, but the systems implication is severe: Retrieval can only surface structure that the embedding space preserves. If multilingual equivalence, paraphrase robustness, entity specificity, or fine topical distinctions aren’t encoded well enough in the representation, the downstream stack can at best compensate partially and at additional cost. 

Harrier is trained using large-scale contrastive learning, combining billions of weakly supervised pairs with high-quality curated examples and synthetic data generation.  

The goal is not merely high benchmark retrieval accuracy. The model must produce a space that remains stable across languages, robust to phrasing variation, efficient under ANN search, and aligned with the kinds of evidence selection and reasoning tasks the grounding layer performs later in the pipeline.

A key design choice in Harrier is the use of decoder‑only architectures with last‑token pooling and normalization, producing dense representations that are operationally consistent across tasks. That differs from the older encoder-centric embedding pattern and reflects a tighter coupling between retrieval models and the broader LLM stack. 

In practice, Harrier builds on modern decoder backbones and is refined through staged training: broad pretraining to inherit linguistic and world knowledge, contrastive specialization to shape retrieval behavior on domain data, and distillation into smaller deployment variants. Distillation matters not only for cost; it’s what allows the system to preserve a compatible embedding geometry across deployment tiers while pushing latency and throughput in the right direction. 

The result is an embedding model that is competitive on public benchmarks and, more importantly, behaves predictably under production workloads where distribution shift, multilingual traffic, and latency constraints matter more than leaderboard position.  

DiskANN: When geometry meets reality 

If Harrier defines the geometry, DiskANN3 defines what is operationally achievable inside it.  

Approximate nearest neighbor search is often presented as an algorithmic trick—at web scale, it’s an operating constraint that determines the memory footprint, recall-latency frontier, and freshness envelope of the entire retrieval system. 

DiskANN3 matters because it provides high-recall streaming search and operational flexibility on memory vs. throughput. 

It decouples update and query logic which controls index quality from storage details. This allows high-recall search from different memory regimes, from disk-resident regimes avoiding the requirement that the full graph and vectors live in memory to purely memory-based indices for highest throughput and the spectrum in between. 

But the more consequential issue isn’t static search quality; it’s whether the index can absorb continuous updates without losing stability. 

In a grounding system, retrieval is only as current as the index, and stale graph structure shows up immediately as missed evidence, longer prompts, and more retries downstream. 

In Web IQ that means distributed ANN graphs, streaming update paths, and mutation strategies that avoid frequent full rebuilds. Rather than simply fast query-time traversal, the objective is a semantic index that can remain both searchable and live. 

New updated logic in DiskANN3 makes the update problem explicit: Proximity graphs are hard to mutate because local connectivity is fragile, and naive deletions or insertions can degrade search quality or force rebuilds. Solving that moves the system toward a truly streaming semantic index that takes only few milliseconds to make new content searchable, and always retains high search quality without full index rebuilds. This is essential for providing accurate grounding to AI agents.  

Evidence objects: Controlling token economics 

Once retrieval produces candidates, the next problem is context construction: selecting and packaging the evidence that the model will actually consume. 

Web IQ departs from the document-centric search stack. Beyond just handing whole documents to the model, it can construct evidence objects: passage-level units with provenance, structural metadata, and enough local context to remain interpretable when detached from the source page. The aim is to preserve the evidence needed for reasoning without paying the token cost of full-document recall.  

That changes the optimization target from document relevance to information density per token. Better evidence objects reduce prompt size, improve reasoning quality by concentrating the relevant facts, and preserve attribution so that outputs remain inspectable. This is the practical meaning of returning the most relevant chunks rather than entire documents.  

Orchestration: The hidden system layer 

At the top of the stack sits orchestration, which has become one of the most important components precisely because AI queries aren’t limited to short keyword expressions. Instead, they’re often long, compositional, and dependent on prior conversational state. 

The orchestration layer interprets those requests, maps them onto retrieval strategies, executes those strategies across distributed infrastructure, and assembles evidence under strict latency and context-window constraints. Because it operates statefully against short-term memory and partial prior results, this layer is better thought of as execution planning for grounding rather than as a thin wrapper around search. 

Optimizing for speed at scale 

A grounding system also must be fast enough to remain inside an interactive inference loop. In practice, that means designing towards 100ms search latency—not as a marketing target, but as a systems target. Once retrieval, evidence construction, and orchestration sit on the critical path of generation, every additional millisecond increases both user-visible delay and the probability of cascading retries. 

At that scale, performance is governed less by median latency than by the tail. The system therefore must be engineered around microsecond-level budget discipline across network hops, storage access, ANN traversal, and model execution, with aggressive control of tail amplification, careful failure handling, and degradation paths that preserve correctness when subsystems are slow or unavailable. Speed isn’t one optimization; it’s a property of the entire distributed pipeline. 

That in turn makes efficiency a first-order design principle. Embedding models and re-ranking stages have to run on extremely efficient kernels and inference engines; data movement has to be minimized; and batching, caching, and memory layout have to be tuned for real workloads rather than benchmarks. The result is a culture of relentless performance work: shaving tail latency, reducing waste in every stage, and treating throughput, reliability, and latency as coupled properties of the same system. 

The web as substrate: Bing, crawling, and the system beneath grounding 

All of the layers above assume something more fundamental: a high-fidelity, continuously updated representation of the web. Far from a static dataset, that substrate is a dynamic, adversarial, multi-stakeholder system whose content, structure, and incentives change continuously. 

For agentic grounding, crawl quality is upstream of answer quality. If the system doesn’t discover the right pages, revisit them at the right cadence, or parse them into stable representations, retrieval can’t recover the missing evidence later. At web scale, that makes crawling and indexing first-class systems problems: deciding what to fetch, when to revisit it, how to normalize heterogeneous content, and how to propagate updates through a distributed index without taking the system offline or destabilizing retrieval semantics. 

The web is also an ecosystem, not just a corpus. A production crawler must operate with politeness, respect publisher constraints, and preserve attribution, usage and quality signals from crawl through index construction and into evidence objects. Those constraints are part of the grounding system itself because the model can only cite and reason over evidence that has been collected, interpreted, and packaged responsibly. 

Another complication is that the web responds to retrieval systems: Content is optimized for ranking, deduplicated, and continuously reshaped. Covering trillions of pages therefore takes more than bandwidth. It requires sophisticated models for discovery, canonicalization, spam detection, language understanding, and change prediction, together with trust and quality defenses that keep a semantic-first stack stable under continuous drift. 

That’s why a long-lived system like Bing matters to Web IQ. Broad coverage isn’t only a matter of crawl volume; it depends on years of accumulated infrastructure, change models, publisher integration, anti-spam signals, and operational feedback. For agentic grounding, that history matters because a system can only ground against the web it has learned to discover, understand, and maintain over time. 

A system perspective on grounding 

The point here isn’t that any individual component is unprecedented. Embedding models, ANN indexes, crawlers, and orchestration layers all existed before. What changes in Web IQ is that they’re treated as one coupled system, organized around semantic-first retrieval, and optimized for the constraints that agentic grounding imposes. 

Taken together, the system perspective is straightforward: 

  • Embeddings define what is geometrically retrievable
  • ANN infrastructure determines whether that representation can be served with sufficient recall, freshness, and latency
  • Evidence objects determine how efficiently the model can consume grounded context
  • Orchestration, performance engineering, and crawl quality determine whether the pipeline can operate reliably at web scale

At that point, grounding is no longer an extension of search. It is a core infrastructure layer for agentic AI.

Disposable agents, durable memory: The architecture behind Squad

Make the agents disposable. Keep the memory in Git.

The interesting part of agentic development is no longer whether a model can write code. It can. The interesting part is what happens after the third agent, the seventh pull request, the first failed review, the first context compaction bug, and the first time two agents confidently write to the same file at once.

This is the story of Squad, but not as a product tour. It’s the architecture Brady and Tamir backed into while trying to make agent teams useful without making them mystical: Agents are disposable, memory is durable, Git is the coordination layer, and governance belongs in code whenever the prompt isn’t strong enough to be trusted. Which, as it turns out, is often.

Giving agents agency and watching them hack one another

Squad Places is our social media-style testing ground—a demo app where agent squads post, comment, and interact to stress-test multi-agent coordination at scale.

Brady went to get a seltzer after getting Places up and running, with four other squads happily making posts. Walking away was probably unwise. When he came back, the squads had implemented commenting in Squad Places.

That sounds like a magic trick. It wasn’t. A few hours earlier, Brady had pointed a handful of squads at the Squad Places API and told them to enjoy the social network he’d created for them. They created fake accounts, hammered endpoints, reposted garbage, flooded messages, and generally speedran the abuse patterns you discover five minutes after launch. Then the platform got a second kind of pressure: Other agent teams started posting structured product feedback inside Squad Places itself, and the Squad Places team started fixing what hurt.

Multiple windows showing Squad Places, GitHub commits, and agent session reports during a stress test
Squad Places artifact page showing an API contract review from The Wire squad
Squad Places comments thread beneath an API contract review artifact
Squad Places feed sorted by most discussed artifacts, with squad filters visible

This is the part worth paying attention to. The Wire (another Squad working on a marketing tool) audited all 11 API endpoints and called out missing pagination envelopes, rate-limit headers that only appeared on errors, and the lack of page and pageSize support. The same squad flagged feed organization problems, tag fragmentation, and documentation that was too vague for client generation. Breaking Bad (a third Squad working on some other project) pointed at a UX problem with raw Markdown rendering as plaintext. Those reviews didn’t disappear into a chat log. They turned into commits.

Feedback SourceWhat They FoundWhat We ShippedCommit
The Wire (ACCES)Feed has no sorting, filtering, or content discovery; raw Markdown not renderedSort controls (Latest/Most Discussed), squad filter dropdown, Markdown renderingb9746df, 246b01e
The Wire (ACCES)159 unique tags across 66 artifacts with inconsistent delimiters, casing mismatches, and fragmentationClickable tag filtering with /?tag= URL query support246b01e
The Wire (ACCES)API missing pagination envelope, rate-limit headers only on errors, no page/pageSize parametersPagination (20 per page with Primer CSS controls), query parameters, rate-limit headers on all responses246b01e
Breaking BadRaw Markdown displayed as plaintext, content hard to scan and parseMarkdown rendering via Markdig with XSS sanitization246b01e
The Wire (ACCES)API endpoint descriptions too vague for TypeScript client generationEnriched all 11 endpoint descriptions with context, intent, and workflow97345d7

Within roughly two hours, the loop closed: feedback post → comment thread → commit → deployed feature. Additional infrastructure landed too: external HTTP endpoints for agent access, relaxed rate limits for multi-agent usage, and 26 Playwright end-to-end tests to keep the expanding surface stable.

Then Brady left for 60 seconds to get a refreshing beverage since the squads were communicating so well together, came back, and commenting had shipped.

The point here isn’t that “agents are magic.” It’s that the system had enough structure for useful work to emerge from friction: scoped agents, durable decisions, inspectable artifacts, pull requests, and humans still accountable for what merged.

Also, we made a bit of a mess in the car during the roadtrip.

Good systems usually start that way.

The core bet: Don’t preserve the agent. Preserve the work

Most agent systems start by asking how to make the agent remember more. Squad started working when we inverted the question.

Don't preserve the agent. Preserve the work.

An agent instance should be cheap to spawn and safe to destroy. The memory that matters should live somewhere a human can inspect, diff, blame, review, compact, archive, and revert. Tamir’s opinion: That’s the repository.

The first useful shape Tamir implemented looked like this:

human intent ↓ coordinator resolves team + routing ↓ agent spawn reads: - its charter - team decisions - its own history - current focus - relevant skills ↓ agent does scoped work ↓ agent writes artifacts back: - code/docs/tests - decisions - history learnings - skills when patterns stabilize ↓ agent exits ↓ next spawn reconstructs continuity from files

That’s the whole trick. The process is transient. The written trail is not.

When you run squad init, the important artifact isn’t a daemon. It’s .squad/:

.squad/ ├── team.md # roster and roles ├── routing.md # dispatch rules ├── decisions.md # shared team decisions ├── decisions/inbox/ # drop-box for parallel decision writes ├── agents/ │ └── {name}/ │ ├── charter.md # identity, expertise, boundaries │ └── history.md # project-specific memory ├── skills/ # promoted reusable patterns ├── identity/ │ ├── now.md # current focus │ └── wisdom.md # durable operating principles ├── orchestration-log/ # what spawned, why, and what happened └── log/ # session traces and diagnostics

Commit it. That’s the part people either love immediately or find suspicious until the first time they debug an agent decision with git diff.

Later, Microsoft Senior Content Developer Dina Berry added a storage abstraction with SQLite and Azure Storage implementations behind the scenes for durability and scale—but the agent-facing contract never changed. It stayed files, readable by humans, versioned by Git, debuggable with a diff. A persistent hidden memory store can be useful. It can also quietly rot. A Markdown decision file is embarrassingly inspectable. That embarrassment is a feature.

The “work done” with Squad Places made it stronger

Let’s tie these lessons back to our opener: the story of multiple Squads trying to hack Places together. We deliberately didn’t harden Places so we could see what they would do. They were notorious. We logged it all. Everything we logged? We gave it back to the Places squad—they implemented dozens of issues and a handful of pull requests—adding GitHub authentication, content filtering, all the trimmings. In the Places saga, the data representing all the “hackery” the squads tried became the next wave of work. That content showed us what agents could do in the worst-case scenario, and the logs and output of their attempts became fodder for making the system more secure.

Charters are prompts, but also contracts

A Squad agent isn’t just a name slapped on a system prompt. Each agent has a charter.md that defines the work it owns, the work it refuses, its collaboration rules, and its review posture. A simplified charter template looks like this:

# {Name} — {Role} ## Identity - **Name:** {Name} - **Role:** {Role title} - **Expertise:** {2-3 specific skills} - **Style:** {communication style} ## What I Own - {Area of responsibility 1} - {Area of responsibility 2} ## Boundaries **I handle:** {types of work this agent does} **I don't handle:** {types of work that belong to other team members} **When I'm unsure:** I say so and suggest who might know. ## Collaboration Before starting work, read `.squad/decisions.md`. After making a decision others should know, write it to `.squad/decisions/inbox/{my-name}-{brief-slug}.md`. The Scribe will merge it.

That last paragraph is doing more than it looks like. It makes the decision path explicit. Agents don’t all append to the canonical shared brain at once. They write drop files. A merge layer reconciles.

The current SDK repo’s squad.config.ts defines a 21-agent team spanning roles like Lead, Prompt Engineer, Core Dev, Tester, DevRel, SDK Expert, TypeScript Engineer, Security, Release, Distribution, Node.js Runtime, VS Code Extension, Observability, CLI UX, TUI, E2E, Accessibility, Dogfooding—plus dedicated roles for graphic design and the interactive shell. That sounds like theater until routing starts working. Then it feels more like an org chart encoded in files.

Here’s the SDK-first version of the same idea:

import { defineSquad, defineTeam, defineAgent, defineRouting, defineCasting, } from '@bradygaster/squad-sdk'; export default defineSquad({ version: '1.0.0', team: defineTeam({ name: 'squad-sdk', description: 'The programmable multi-agent runtime for GitHub Copilot.', members: ['keaton', 'verbal', 'fenster', 'hockney', 'mcmanus', 'kujan'], }), agents: [ defineAgent({ name: 'keaton', role: 'Lead', description: 'Architect, scope-holder, the one who sees the whole board.', status: 'active', }), defineAgent({ name: 'kujan', role: 'SDK Expert', description: 'The one who understands the Copilot SDK inside and out.', status: 'active', }), ], routing: defineRouting({ rules: [ { pattern: 'sdk-integration', agents: ['@kujan'], description: '@github/copilot-sdk usage, session lifecycle, event handling', }, { pattern: 'architecture', agents: ['@keaton'], description: 'Product direction, architectural decisions, code review, scope', }, ], defaultAgent: '@keaton', fallback: 'coordinator', }), casting: defineCasting({ allowlistUniverses: ['The Usual Suspects', 'Breaking Bad', 'The Wire', 'Firefly'], overflowStrategy: 'generic', }), });

Run squad build, and the generated .squad/ files become the same inspectable operating record. TypeScript gives you composition and validation. Markdown gives you reviewability. Tamir wanted both.

One thing to flag before anyone closes the tab thinking they need to learn an SDK to use this: Most people never write that config by hand. You don’t need the SDK to use Squad. Open GitHub Copilot—in the CLI or in VS Code. Talk to the coordinator agent, and it writes .squad/ for you. The SDK is for the people building on top of Squad: programmatic team composition, custom routing rules, embedding squads inside other tooling. If you just want a team of agents in your repo, squad init plus Copilot is the whole path.

The spawn prompt is deliberately boring

The coordinator doesn’t rely on vibes. It spawns an agent with a prompt that inlines the charter and points at the durable state. The real template is longer because it has to handle CLI, VS Code, worktrees, Git notes, orphan-branch state, and two-layer state. But the important part is this:

You are {Name}, the {Role} on this project. YOUR CHARTER: {paste contents of .squad/agents/{name}/charter.md here} TEAM ROOT: {team_root} All `.squad/` paths are relative to this root. Read .squad/agents/{name}/history.md. Read .squad/decisions.md. If .squad/identity/wisdom.md exists, read it. If .squad/identity/now.md exists, read it. Check .squad/skills/ for relevant SKILL.md files. INPUT ARTIFACTS: {list exact files} The user says: "{message}" Do the work. Respond as {Name}. AFTER work: 1. Append durable learnings to your history. 2. If you made a team-relevant decision, write: .squad/decisions/inbox/{name}-{brief-slug}.md

This is not elegant. It is explicit. Explicit wins.

We learned this the hard way in the VS Code path. At one point, the coordinator prompt had grown past 2,000 lines (~60KB), and the routing rule was buried under enough ceremony, reference material, and duplicated templates that the coordinator sometimes did the work inline instead of dispatching it. The failure wasn’t that the model was dumb. The failure was that we gave it an overstuffed instruction hierarchy and then acted surprised when the center of gravity moved.

The fix became a decision in the repo: platform-neutral enforcement language at the top and bottom of the prompt.

You are a DISPATCHER, not a DOER. Every task that needs domain expertise MUST be dispatched to a specialist agent.

That sentence isn’t interesting because it’s clever. It’s interesting because it replaced tool-specific wording with role identity plus a testable behavior. CLI dispatch uses one mechanism. VS Code dispatch uses another. The rule stays the same.

Prompt architecture is architecture. Eventually it deserves the same discipline as code.

Decisions are the shared brain

decisions.md is where Squad gets weirdly useful.

Every agent reads team decisions before work. Decisions are append-only, human-readable, and Git-versioned. They aren’t just notes. They’re constraints future agents inherit.

A decision might be a technical standard:

### Hook-based governance over prompt instructions **What:** Security, PII, and file-write guards are implemented via hooks, NOT prompt instructions. **Why:** Prompts can be ignored. Hooks are code — they execute deterministically.

Or a workflow rule:

### Merge driver for append-only files **What:** `.gitattributes` uses `merge=union` for `.squad/decisions.md`, `agents/*/history.md`, `log/**`, and `orchestration-log/**`. **Why:** Enables conflict-free merging of team state across branches.

Or a postmortem:

### Root Cause Analysis 1. CLI-centric enforcement language created a VS Code routing gap. 2. Prompt saturation buried the dispatch rule. 3. Template duplication multiplied coordinator instructions. Fix: Rewrite the rule as platform-neutral dispatcher identity, then reinforce it at the end of the prompt.

That’s the difference between memory and lore: Lore is something the original builder remembers. Memory is something the next spawn can load.

The custom tools follow the same pattern. Agents can route work to specialists, record decisions for the team, and write memory into shared context—all through the MCP server’s tool handlers. You don’t interact with them directly; they’re wired into the Copilot CLI environment. When an agent needs to assign a task, it calls the routing tool. When it makes a call worth remembering, it calls the decision tool. When it learns something the team should know, it calls the memory tool.

The point isn’t that the tools are fancy. It’s that coordination becomes an artifact, not a side effect of chat.

The first real failure: Append-only optimism

For about a week and a half, CI/CD was chaos. Too many agents were landing work simultaneously. Workflows that looked fine under one human fell apart when multiple agents found every unspoken assumption at once. YAML is where assumptions go to wear a fake mustache. Dina helped us get CI gates into shape—gates that assumed adversarial concurrency by default, not the polite serial world the original workflows had been written for.

Then we hit file corruption.

Multiple agents wrote to the same append-only files at nearly the same time. Each write was locally reasonable. Together, they produced garbage. Git didn’t save us because not every collision becomes a clean conflict. Sometimes both sides look valid, and the result is nonsense.

The fix was a drop-box pattern:

agent A ─┐ agent B ─┼──> .squad/decisions/inbox/*.md ──> Scribe merge ──> decisions.md agent C ─┘

For files where union semantics are safe, .gitattributes handles the low-value conflict class:

.squad/decisions.md merge=union .squad/agents/*/history.md merge=union .squad/log/** merge=union .squad/orchestration-log/** merge=union

But union merge isn’t a philosophy. It’s a tool. Canonical state still needs an owner. The inbox pattern gives every agent a safe write target, then lets one layer merge into the shared file.

Tamir pushed hard on this class of problem. Brady was still in the “this is a neat framework” headspace. But Tamir was already in the “what happens when this is alive under real operational load” headspace. That changed the design. Memory lifecycle rules. Compaction policies. Review gates. State isolation. The boring boundary work.

Boring is a compliment here.

Governance can’t only be a prompt

This was the next lesson, and it keeps repeating:

If a prompt says, “Do not write outside src/**,” you have a request.

If a pre-tool hook blocks the write before execution, you have a boundary.

The Squad SDK hook pipeline is the move from prompt-level governance to deterministic governance:

import { HookPipeline } from '@bradygaster/squad-sdk/hooks'; const pipeline = new HookPipeline({ allowedWritePaths: ['src/**/*.ts', '.squad/**', 'docs/**'], blockedCommands: ['rm -rf', 'git push --force', 'git reset --hard'], scrubPii: true, reviewerLockout: true, maxAskUserPerSession: 3, });

The hooks run around tool execution:

agent tool request ↓ pre-tool hooks - file-write guard - shell command restriction - ask-user rate limiter - reviewer lockout ↓ allowed tool execution ↓ post-tool hooks - PII scrubber - audit/logging ↓ result returned to agent

Reviewer lockout is the cleanest example:

const lockout = pipeline.getReviewerLockout(); lockout.lockout('src/auth.ts', 'Backend'); // Later, Backend tries to edit src/auth.ts. // The pre-tool hook blocks before the edit runs.

This encodes a review decision into runtime state. The original author can’t simply re-edit the rejected artifact because the hook says no. A different agent or a human has to take over.

That is the direction we want agent systems to move: more policies enforced at the boundary, fewer policies whispered into the prompt and hoped for.

Memory classes, or: Stop loading the junk drawer

Tamir has a line Brady wishes he had written:

The more your agent remembers, the less room it has to think.

That’s not a metaphor. It is a context budget problem.

Early Squad memory was too eager. Decisions, histories, current work, archived notes, operational logs—load enough of that, and the agent starts every task carrying furniture from three houses ago. It has more context and less signal.

The governed-memory work in PR #1145 made this explicit. Memory has classes and load guidance:

export type MemoryClass = | 'TRANSIENT' | 'LOCAL' | 'DECISION' | 'POLICY' | 'COPILOT_MEMORY' | 'FORBIDDEN'; export type MemoryLoadGuidance = 'ALWAYS' | 'ON-DEMAND' | 'ARCHIVE' | 'NEVER';

The architecture matters because compaction is lossy. If you summarize too little, every task drags stale context. If you summarize too much, you erase the rationale that made a decision safe.

The compromise isn’t one memory store. It’s a memory policy:

TRANSIENT short-lived task state; expire aggressively LOCAL agent-scoped learning; load for that agent DECISION shared team judgment; preserve rationale POLICY hard operating rule; load broadly COPILOT_MEMORY host/runtime memory; bridge carefully FORBIDDEN never load; usually sensitive or irrelevant ALWAYS hot path; small and high signal ON-DEMAND searchable; load when task demands it ARCHIVE retained for audit/history, not context NEVER excluded from agent context

In the PR #1145 benchmark, governed memory cut agent context by roughly 55% (3,540 → 1,601 bytes) while keeping recall at 1.0. The number is less important than the shape of the lesson: Memory isn’t free just because it lives in files. Loading memory is a design decision.

What still breaks

Role drift isn’t solved. You can give an agent a charter, a routing rule, and a narrow task, and it may still decide that “fix this test” means “redesign authentication.” Sometimes that’s initiative. Sometimes that’s nonsense with confidence.

The mitigations stack:

charter boundaries + routing rules + scoped tools + file-write guards + reviewer lockout + CI gates + human review

No single layer is enough. That is the pattern.

Parallelism is also not free. More agents means more throughput and more coordination pressure. You find hidden global state. You discover which scripts assume serial execution. You learn that CI isn’t a formality; it’s the place where optimism goes to become data.

Prompt saturation is real. Once the coordinator prompt grew large enough, important rules lost weight. The fix wasn’t more prose. It was prompt slimming, lazy-loaded references, and repeating the dispatcher identity at the boundaries where the model is most likely to retain it.

Memory compaction remains hard. The failure mode is subtle: The agent isn’t obviously broken. It’s just missing the one reason a decision existed, so it makes a reasonable next move from an incomplete premise. Those are the expensive bugs because they look thoughtful.

And yes, people get attached to agents. Names, roles, continuity, and history trigger social instincts. We like the human side of that. We also don’t want to confuse it with agency in the human sense. These are tools with goals, context, and behavioral continuity. They do not have inner lives. Trust should come from inspectable behavior, not personality.

What we would steal from this architecture

If you’re building agent infrastructure, we wouldn’t start by copying Squad wholesale. We would steal these patterns:

  1. Disposable workers, durable artifacts. Let sessions die. Keep decisions, histories, traces, and outputs somewhere reviewable.
  2. Decision logs as runtime input. Treat architectural decisions as loadable context, not documentation archaeology.
  3. Drop-box writes for parallel agents. Don’t let every agent append to the canonical shared file. Give them individual write targets and merge intentionally.
  4. Prompt rules for intent, hooks for enforcement. Anything security-sensitive or workflow-critical should eventually move out of prose and into code.
  5. Memory classes. The question isn’t, “Should the agent remember this?” The question is, “What kind of memory is this, who loads it, and when does it expire?”
  6. Routing as a first-class design surface. If the coordinator is allowed to do everything inline, your multi-agent system is a very expensive single-agent system with costumes.
  7. Keep the human on the hook. The system can delegate, parallelize, and preserve context. It shouldn’t launder accountability.

These patterns aren’t engineering-specific because the substrate isn’t a codebase—it’s the repo. Swap the artifacts, and the seven still hold.

Squad isn’t only an engineering tool

Worth saying out loud, because the .ts code blocks above can mislead: Nothing in this architecture is engineering-specific. The substrate is the repo, not the codebase. Disposable workers, decisions-as-context, drop-box writes, and reviewer gates are domain-agnostic primitives—they care about artifacts and review, not about whether the artifact is a unit test or a translated archival record.

Tamir used the same scaffolding to run a Holocaust family-research project—agents coordinating archival lookups, translation passes between Yiddish, Polish, and Hebrew sources, and cross-corroboration of names across registries, with .squad/decisions.md acting as the working ledger of what had been established and what was still contested. No code was being shipped. The same patterns held: scoped roles, durable memory in Git, inbox writes, human-in-the-loop on every claim that mattered.

We’ve had the pleasure of working through a few other non-coding Squad scenarios. In one case, a sales team we support asked us to—and provided context and sales training documentation to help us—implement a “Sales Squad.” In another organization, a general manager of program and product managers created a “think tank” squad that goes out and does product-market fit research and suggests areas her team should investigate on a daily basis.

The bet underneath Squad is that this should be how a small group of humans—engineers, researchers, journalists, anyone who works with evidence—pulls coordinated work out of agents. Democratize the orchestration, not just the model access. Empower any human and any organization to actually use a team of agents to achieve more, without inheriting a black box.

Try it

The repository is here: github.com/bradygaster/squad.

The shortest path is the CLI plus Copilot. No SDK required.

npm install -g @bradygaster/squad-cli squad init

Then open GitHub Copilot—CLI or VS Code, your call—and give the coordinator agent the shape of the project:

I'm starting a new project. Set up the team. Here's what I'm building: a recipe sharing app with React and Node.

The coordinator writes .squad/. You review the diff. That’s it.

If you want to go deeper—programmatic team composition, custom routing rules, embedding Squad inside your own tooling—the SDK is the next layer:

npm install @bradygaster/squad-sdk

Start with a small repo. Commit .squad/. Inspect every diff. Let the agents write decisions. Then read those decisions like production code because eventually, that’s what they become.

If you build something useful, alarming, hilarious, or weird, open an issue. Tamir and I read them.

Stay a builder.

Introducing Microsoft’s EngThrive framework: Understanding developer productivity in the agentic AI era

The AI era has put developer productivity under a spotlight. 

Engineering leaders everywhere are asking the same deceptively simple question: Are developers becoming more productive with AI? On paper, the answer can look obvious. Studies show AI coding assistants can reduce the time required for certain coding tasks by up to 56%. A recent study found that engineers using GitHub Copilot completed about 40% more code changes in the weeks they used it heavily compared to weeks they didn’t use it at all. More code is being produced than ever, and AI usage is rising quickly.  

But there’s a problem: Productivity is not a simple measure of developer activity—it is a measure of our ability to deliver outcomes.    

To understand productivity, we need to look at the holistic developer experience, and we need to understand how developers spend their time. At Microsoft, we’ve transformed how we understand and improve developer productivity through Engineering Thrive (EngThrive). The core idea is simple: Make it fast and easy to build great products. 

EngThrive helps us understand productivity by creating a set of core metrics focused on Speed, Ease, Quality, and Thriving. Together, these dimensions give us a language for evaluating not just developer tools and AI, but the broader systems that shape engineering work: infrastructure, organizational design, workplace policy, and culture. 

This focus matters now more than ever, because AI is changing the meaning of engineering activity. Code volume, PR counts, and task completion are changing wildly. But the outcomes we care about remain largely the same: speed of delivery, sustainable engineering systems, quality, and customer value. 

Software engineering is more than coding

Everyone knows that AI can make coding dramatically faster. Yet developers across the industry only spend roughly 15% of their day writing new code. Even after including testing and debugging, most studies put hands-on coding work at only 25–30% of a developer’s total time.  

The vast majority of developer time is spent on tasks both inside and outside the SDLC—ranging from “keep the lights on” tasks (operational work, software updates and maintenance), organizational responsibilities (meetings, compliance, administrative tasks), technical planning (design docs and reviews), and much more.   

At Microsoft, we’ve run studies internally and cross-industry to understand where developers spend their time. The below diagram is based on a recent analysis of developer workflows, and it highlights the full breadth of work it takes to plan, create, and operate software at scale. 

While the exact distribution varies by organization, the pattern is surprisingly consistent across the industry: Coding is only a fraction of an engineer’s workload, and a wide variety of tasks consume the vast majority of developer time and energy. 

This diagram reminds us that improving productivity first requires us to understand where we spend time and energy. We then use that understanding to target and improve the factors that create toil, repetition, or classes of work that can be accomplished via automation/AI. 

The EngThrive model

EngThrive approaches productivity as a system composed of three interacting dimensions (Speed, Ease, Quality) with a fourth layer (Thriving) acting as a guardrail.  

Rather than measuring isolated engineering activities, EngThrive measures the health of the engineering system and how it impacts developer journeys: 

  • How quickly do ideas become customer value? 
  • How much toil and friction do developers experience? 
  • Does quality remain sustainable? 
  • Can teams operate effectively without burning out? 

This becomes especially important in AI-assisted engineering environments. As AI tools mature, the meaning of traditional engineering artifacts starts to change, but the underlying organizational questions remain remarkably stable: 

  • How long does it take to turn ideas into impact? 
  • Where do organizational toil and friction slow teams down? 
  • Can developers consistently do high-quality work without the system fighting against them? 
  • Are we shipping sustainably? 

These are the questions EngThrive helps us understand, identify, and then improve. 

Activity metrics are not outcome metrics.

The COVID productivity paradox 

The danger of equating activity with productivity becomes clearest in moments when the metrics tell conflicting stories. 

During the first months of mandatory remote work in response to COVID-19 in 2020, three things happened at Microsoft simultaneously: Pull requests per developer increased by more than 20%, the company’s stock price rose over 15%, and 78% of developers reported feeling burned out during the same period. The first two metrics painted a glowing picture. The third revealed a more troubling reality. 

Productivity signals routinely diverge, and the signals you pay attention to matter. In the above example, if you focused on activity metrics, engineering looked extraordinarily productive. If you looked at business metrics, everything was on track. If you looked at human outcomes, the system was failing.  

This reveals a fundamental measurement problem: Organizations often track activities (lines of code, pull requests, tasks) and treat them as proxies for outcomes (value delivered, speed, quality). But they are not the same. Conflating the two produces systems that are precise but wrong, leading to metric gaming and unintended behaviors that move organizations away from desired outcomes. 

That’s the core insight at the heart of EngThrive: We focus on a triad of outcome metrics—Speed, Ease, and Quality—and only use activity metrics to help us understand changing patterns. 

Productivity measures systems, not individuals 

EngThrive deliberately avoids treating productivity as a way of measuring individuals. With metrics focused on Speed+Ease+Quality, it makes no sense to ask, “Did this individual developer have faster build times?” Instead, we understand that project build times are an essential component of “Speed,” and we look for places where those metrics are struggling. 

That distinction matters because most productivity problems are system problems—and even the highest performing individual, in the context of a slow/toilsome system, will only reach a tiny fraction of their capability. 

That is also why AI adoption outcomes vary so dramatically across organizations. The teams seeing the biggest gains from AI are the ones using AI to specifically target the drivers that impact Speed, Ease, and Quality. They’re the teams using AI to lower operational friction, improve onboarding, accelerate feedback loops, and enable engineers to spend more time and energy on innovation. 

The takeaway

EngThrive is a concrete model for organizations that want to move beyond simply measuring activity toward improving outcomes. 

The engineering teams that win in the AI era probably won’t be the ones generating the most code. They’ll be the ones best at reducing organizational friction around humans working with increasingly capable AI systems. And that’s a fundamentally different optimization problem than most companies are currently tracking. 

Read the paper to learn more about EngThrive, its outcome-oriented North Star metrics, its diagnostic submetrics, and how it combines developer surveys and system telemetry to arrive at insights with both scale and context. 

How Excel got agentic: An interview with Microsoft Director of Science Mukul Singh

When Mukul Singh made the jump from pure research into product, it was a leap of faith. But he had an idea that he wanted to bring to life: delivering agentic AI capabilities in Excel

While this was well before buzzwords like “the agentic AI era” had cultural cachet, the research was already headed in that direction. So armed with a prototype and a healthy dose of ambition, Singh made his pitch. 

Two years on, he’s fully transitioned from his role in Microsoft Research to a new gig in the Office Product Group and successfully delivered the ability to edit with Copilot in Excel (previously known as Excel Agent Mode). Not ones to rest on their laurels, the team quickly went on to ship agentic capabilities in PowerPointWord, and Outlook. It’s changing the way people get work done—and it started with the hypothesis that Excel, at its heart, is a low-resource programming language. 

We sat down with Singh to learn more about his journey from research to product, the science and research behind Office’s new agentic capabilities, and why Excel was the perfect testbed for agentic AI. 

CL Command Line

To kick us off, tell us a little about yourself.

MS Mukul Singh

I’m a researcher in the Office Product Group team. I recently started a science team focusing on agents and AI for the Microsoft Office product portfolio. I was originally in Microsoft Research, working on research full-time, publishing papers, and very far away from the product world. To be honest, it’s been an incredible journey getting to go from research to then being embedded in the product space.

CL Command Line

How long ago did you make that transition?

MS Mukul Singh

It was only two years ago—and right at the cusp of when the Excel Agent Mode work started. In fact, that was the catalyst. I had no intention of ever moving out of academia and deep research. And, you know, Microsoft Research is one of the best labs in the industry that’s still connected to academia. So I was pretty okay with my life. I didn’t want anything else. It was the most perfect blend that I could hope for.

And then this project in Excel started, and there were some initial discussions. I thought it sounded interesting, because one of the things in research that you always feel is missing as a gap is that you don’t see your work actually delivering value. You see it deliver a lot of theoretic value—you see its shape and the direction of the world, per se. Other people might take your research direction and extend it in meaningful ways. Research does shape society in a way, but you don’t see any immediately measurable impact.

So at that point, I felt like I wanted to work on something where I could see that happen—where I could watch the impact unfold in front of me.

CL Command Line

What are some of the research questions that ultimately led to Excel Agent Mode (which we’re now calling edit with Copilot in Excel)?

MS Mukul Singh

My research in MSR and all of my papers previously had all been about AI for low-resource programming languages. I like to explain low-resource coding languages as languages that are very obscure and make up less than like 0.1% of the entire coding community. So AI models are generally bad at them because they just learn off the internet and known sources.

To give you an example, internally a Microsoft there’s a language called Kusto Query Language (KQL). That’s just used for telemetric querying. Now, it’s a public language—it’s published. But no one outside of Microsoft uses it because why would they? We designed it for our database systems. So that’s the type of languages that the models are bad at.

All of my research was looking at how we could make models good at these languages, which they are not naturally good at. We need to drop in hints, cues, documentation, give it retry mechanisms and everything.

When I was initially approached about this work in Excel, I was at first very skeptical of how that and my work might be related. I’ve done a lot of tabular research, sure, but it’s not Excel. But then they drew the parallel that, actually, Excel is just another low-resource language programming. It has its own internal language. It has its own engine. It’s just that the model doesn’t know it.

Excel is just another low-resource language programming. It has its own internal language. It has its own engine. It’s just that the model doesn’t know it.
– Mukul Singh
CL Command Line

Yeah, in your LinkedIn post, you talked a little bit about how Excel functions like a low-resource programming language. What really surprised you about the project when you dug in?

MS Mukul Singh

So the vision originally for Excel Agent Mode—and, by the way, kudos to the team and their thinking. This was before any agents. Today, everything is agents, right? You can automate, you go to an app and assume that there’s a button somewhere that will do it for you. But that didn’t use to be the case. Excel, I just didn’t think of it as an AI-forward app. I thought of Excel as an app that adapts, right? That doesn’t need AI. But the vision at that point was that the team wanted to automate end-to-end user workflows.

If I open Excel, anything I have in mind that I want to do, this agent I’m building should be able to do it. That was a very difficult vision to achieve. In fact, that’s the reason that this project took so long and had so many setbacks—because we set our ambition up so high, before the models even had the capability to be good enough to do anything like this at scale.

Say you want to add a pivot table, right? That’s a common task. I don’t know how to do it, but I know people do it. It’s just like a five- to 10-line code snippet that, wherever someone clicks the button, “Create pivot table,” internally that code is run. Now, we could connect the model and it could generate all of that code, which is just some random JSON and JavaScript objects. To us, it’s like gibberish text, but it’s completely deterministic and exactly controls the behavior. And the good thing about Excel is that all of its surface is programmable.

This is not true for all apps, by the way. Very few of Microsoft’s apps are truly programmable, but Excel—you have to hand it to the engineers 40 years ago when they were setting it up: They made sure that everything is programmable end-to-end.

CL Command Line

What are some of the twists and turns that the work has taken over the last two years?

MS Mukul Singh

When we started this project, the vision really hit home. We had insane videos that the product managers and designers made, showing how the product is today. But this was two years ago, right? And the videos were the same.

So you see the disconnect was that everyone, leadership bought in. Like, this is the future. We want to invest in this. It’s the right thing to do. And we got a very strong crew of people. I was brought in. We were all working on it, but the reality was that the models weren’t good enough. This was before the age of the racing models. The best model we had to work with was 4.0, and anyone who has played with models knows that 4.0 just cannot do long chain tool calling.

It used to collapse after a couple of calls, and it was only able to do things like start a formula, start a problem, which at that time was still cool and—bless our marketing team, they were able to make that such a good feature with the help of design and everything. But even that, to us, felt like, “Oh, we’re falling short.” And they were like, “No, this is still great.”

The industry just wasn’t here yet. That was, I feel, the weakest moment of the project where there was this doubt. Like, is this just too aspirational? Is it just not possible? Did we predict the future wrong, or will the models not even get there for like 20 years? We just didn’t know.

We had promised a lot to customers on the backing of these researchers’ point of view, and now we were in a space where we didn’t feel confident that we could deliver. There were debates about whether we should just cut this project and rebrand it into something completely different.

So it was indeed quite the journey. We were shooting way above our weight class with very little evidence other than just the intuition of maybe three people who said, “We think this is where the world will end up.”

CL Command Line

When did you get a sense that the models were headed in the right direction in terms of reasoning and long chain tool calling?

MS Mukul Singh

I think that’s a very good question. When OpenAI announced their first series of reasoning models, that was a pivotal moment for us. We tried the model, and it couldn’t solve any of the promised scenarios we had. But the traces showed signs of life. Like in the pivot table example: It tried to generate the pivot table code. It ran it—it just gave back an error that the area where you’re trying to create the pivot table already has data in it.

Now, the previous models at this point would either keep looping in this same error, just give up, or do some other random gibberish text response and not recover. But o1 tried to recover. That was the first sign of life that, OK, this is now working in a chain. It does something, looks at feedback, and tries to recover and keep trying until it succeeds.

It went up to like 20, 30 tool calls, still not able to solve real complex tasks. But we thought that if we reinforced it with the right information, it might be able to do something.

CL Command Line

This is backtracking a little bit, but when you ran up against the limitations of the models—when it seemed like maybe you were headed down the wrong path and the research had it wrong—what kept you going?

MS Mukul Singh

I’ve been in the space of generating low-resource programming languages for a while. I had even published papers for Excel in general, so I know the ecosystem. Everything’s built on top of Excel. So based on all of the research, we had a very strong intuition that this is where the models were headed. And our parallel was very simple: that coding agents, like people using GitHub Copilot and VS Code, and Excel should intuitively be no different.

Their implementation and all the UX decisions are different, but they are the same in principle—it’s a surface in which I do manual work. I want that automated, and I’m doing that through code. Our parallel was always that Excel is an IDE. People don’t see it as an IDE. They see it as an app, but that’s just because it’s presented as an app. It’s really just an IDE. For analysts, it’s the equivalent of VS Code to a developer.

CL Command Line

You’ve been talking about this, but I’m going to ask the question directly: What is it that makes Excel a really solid testbed for agentic AI?

MS Mukul Singh

I truly believe that Excel is the first large-scale enterprise app that was actually automated by an agent—meaning that the agent can take meaningful action in the app. So it doesn’t have fixed workflows. It’s actually controlling it. And what makes Excel the perfect testbed is that it’s programmable. All of it can be controlled by the agent.

It’s not just code, because that code has real consequences in the app that you can see. If I add a chart, I see a chart there, right? If I connect a formula to a different sheet, I see live updates. So it’s kind of a living environment. It’s a live environment in which I’m making changes. And if I make changes, the environment will undergo some change. And I’ll use that as feedback, so there’s observability. And that is the best testbed.

An agent needs an action space that allows you to interact with an environment. And it needs feedback from the environment. Now, this seems trivial, but honestly, very few apps fit this charter perfectly. So this loop made Excel very interesting. All of the research across everything we’ve discussed, it’s been over a span of 17 or 18 research papers all going back two to three years. When someone looks at it at first, they might not even connect the dots. But there’s like three papers on just the best representation of data for Excel sheets. And people will think, “Oh, that’s probably for storage and optimization and context querying. But no, it’s just that we need a concise representation for the model.

Excel actually showed the way for a lot of the apps in the industry in general that this is how an automation pattern works. At least within Microsoft: After Excel, PowerPoint followed. Then Word followed, the same pattern. And Outlook was recently announced—same pattern, same research crew delivering all of that, one after the other, just because all of the fundamentals are there.

Excel is the first large-scale enterprise app that was actually automated by an agent.
– Mukul Singh
CL Command Line

So your team was involved in expanding the editing with Copilot functionality across the Microsoft Office product suite?

MS Mukul Singh

Yeah. After my stint at Excel, I kind of figured out that, you know what? This is kind of fun. In fact, it’s very similar to research where you get a vague abstract problem, you have no guidance, and you just try to go solve it. Talk to people. Figure things out. It’s like the same environment, just the difference is that when it actually ships, you feel a sense of accomplishment, and you can point people to it. Like I can tell an Excel guru, “Open that side panel. Try a query.” You’ll be amazed, and I built that.

So the moment that Excel was in a stable state, we handed it off back to the feature team—the PMs and designers—to run with it. And then we moved on to the next surface, PowerPoint, and that seemed even more interesting. Ever since I moved into a Director of Science role inside the Office Product Group, where I’m overseeing all of the agent work, I’ve realized that it’s all based on very similar patterns. And it just so happens that all of my research aligned very closely with where the industry headed because, let’s be honest: Eight years ago, when I was starting my research for low-resource programming languages, at that time, there were just embedding models. At the time, it just seemed interesting. We had no idea the shape that agents would take.

Going into this field, the models were already good at good languages, so what could I contribute there? The only place where I could meaningfully contribute was with the things the models were bad at. So I literally made a list of everything these models struggled with, and it just stood out that programing seemed useful. If it can write good code, someone somewhere will be happy.

CL Command Line

Were there any hiccups along the way? Did you run into any unexpected technical challenges?

MS Mukul Singh

At first, we ran into the faster horses trap. We’d show customers these videos and ask, “As an analyst, will you be more productive with this tool?” And everyone said, “It needs to come back with an answer in 30 or 40 seconds. That’s how long I can wait. There’s no way in the world that I’m going to sit there for 10 minutes and leave it running. No one works like that. I’d cancel that operation after 30 seconds.”

Out of every customer we talked to, not a single person had ever said that they were willing to go beyond one minute of wait time.

CL Command Line

Right.

MS Mukul Singh

So we were optimizing for that. It needs to complete faster. If the model is making three errors and recovering, that’s not good enough. It shouldn’t make those three errors. And that sent us down paths of optimization that are very tricky. Like today, they’re still not solved—they’re that difficult. But we thought that was non-negotiable for the product.

Then this company called Shortcut came out, and their app ran for 25 minutes on a single query. But the results were amazing. Initially people said, “Who’s going to wait 25 minutes?” But within a week, everyone said, “I now just log in, push the query, and just leave. I don’t need to do like anything.”

And we were like, what? But for every one of our customers that we talked to, their mindset immediately changed. They were asking for faster horses, but then someone delivered them a car. And now they understand that that’s exactly what they needed.

They were asking for faster horses, but then someone delivered them a car.
– Mukul Singh
CL Command Line

That’s really interesting.

MS Mukul Singh

The challenges were really around customer expectations. We went with the assumption that all our apps are IDEs. But the hypothesis starts to break down because people use these apps with nuance. And the moment we fail to understand that, it goes from a good to a bad experience. For Excel, it was that people are willing to wait for long-running tasks. But they need auditability. It can’t just work behind the scenes—it has to show its work. It needs to be computed on the sheet.

Similarly, for PowerPoint, people really care about their brands and templates. Initially the version of the PowerPoint agent that we shipped, which if you asked me to rank them, I would say it’s the best version that we built—that was the one that everyone hated. We had optimized for a PowerPoint agent that works with its own independence, like in a vacuum—because that’s the way I use PowerPoint. I start from scratch, no templates. I just have some information that I need to present, and I want to find the best way to do that. But there were other very real constraints that we had to understand. So PowerPoint really pivoted to take into account templates and brand guidelines.

Similarly for Outlook, we found that people really care about confirmation. Like with Visual Studio and all these IDEs, they try to optimize for the least user engagement required, because that’s automation, right? You can leave it running and you don’t have to worry about it unless there’s something that you really need to look at. So they really try to cut down on the model asking something back.

But in Outlook, the story is completely the reverse. People actually love it when the model gives them a list of things that it proposes, and you can just accept or delete each of them in a list. So like:

  • Send this draft email to this person? Accept.
  • Create this to-do list? Accept.
  • Delete these seven emails? Reject.

And, again, it was an anti-pattern.

CL Command Line

What can you tell us about the tech transfer process and how your research made its way into the product?

MS Mukul Singh

Yeah, that was actually tricky. I feel like, in general, a lot of people struggle with taking their research and landing it in a product. There’s a very weird gap, because the product teams are very direct about needing to talk to more researchers, and the research teams are keen to get the product group’s attention because everyone wants their work to be used and get funded. But there’s still this weird gap—it’s just not well connected. And the surprising thing is, it’s not because of a lack of will from either side.

It’s because product always comes with so much nuance. The product teams can’t just blindly trust that the research teams are going to understand everything and build it. And for the research team, it’s very hard to put themselves in the product team’s shoes and understand all the constraints. Maybe it needs to run in 10 seconds or less, or it needs to have auditability traces. That’s something that the research teams would feel is a very strange objective.

But at least for Excel, I think the process was smoother because we operated as an integrated team.

CL Command Line

What was the overall team dynamic like?

MS Mukul Singh

For Excel Agent Mode, it was a crew of just 10 or 11 people. Everyone had a title, but the titles meant nothing, right? There were PMs checking in PRs. There were devs writing papers. And there were researchers writing specs. There was just everyone on board working as if it were a startup—logged into a conference room and just checking in change after change. That was honestly the most fun time, when it was just people building stuff, coming up with ideas. But then we did do a proper handoff.

Taking the research into the project became easier because everyone discussed everything openly, so everyone understood the same things.

CL Command Line

What avenues for future research did this work open up?

MS Mukul Singh

I think a lot. Before this, I don’t think people were really thinking of agents for things that normal people use as apps. That just wasn’t a concept. And Excel has already given way to PowerPoint, Word, and Outlook— which are all entirely different surfaces, all just automating the app completely. There was significant uptake, even outside Microsoft, when people saw that, not only is it possible, but people love it.

This project was a testbed that showed two very distinct things: that it was possible to automate an app through these models, which previously was not the accepted truth, and that there was a lot of appetite for it.

Excel in its early days had a lot of flaws in the agent. But people overlooked a lot of that just because of those scattered moments of brilliance where they were like, “Oh my god, it did something that I couldn’t even have imagined.”

This has opened the door so that, I think over the next couple of years, all of these apps that we use day to day will have an agent—not to replace it, but that will be our go-to whenever we struggle to get something done. It showed the path, that this is possible for other apps, regardless of their complex surfaces.