Five ways we’re confusing AI capability and AI reality—and how to bridge the gap

On the evening of June 1 at San Francisco’s Bartlett Hall, Microsoft CTO Kevin Scott spoke at a joint event with Lectures on Tap, attended by approximately 150 developers, founders, media, and tech industry leaders. His talk focused on what he described as the growing perceptual disconnect between AI capability and AI reality: the tendency to mistake rapid advances in model performance for equally rapid progress in deployment, organizational transformation, trust, and real-world value creation.

Scott argued that the AI industry is at an inflection point where technical breakthroughs are arriving faster than institutions, workflows, and human systems can absorb them. While acknowledging the extraordinary pace of progress in areas like software development and agentic systems, he emphasized that the difficult challenge ahead is operationalizing these capabilities responsibly and meaningfully at scale.

Here are his five observations of the ways in which AI reality is diverging from apparent capability.

1. Capability ≠ deployment

According to Scott, one of the biggest mistakes people are making right now is confusing technical capability with real-world deployment. The fact that a model can do something impressive doesn’t mean the surrounding systems, economics, governance, and human behaviors are ready to absorb it at scale.

We just shouldn’t have uniform faith that, as AI model capabilities improve, we’re going to get this crazy fast deployment everywhere.

“Today’s AI models are actually more capable than the things we’re using them for in the real world,” he said, addressing today’s “capability overhang,” as he has dubbed it. “We just shouldn’t have uniform faith that, as AI model capabilities improve, we’re going to get this crazy fast deployment everywhere.”

2. Closed feedback loops ≠ universal progress

Scott explained that some areas of AI (like agentic software development) are improving extraordinarily quickly because tight feedback loops allow those systems to iterate, evaluate, and refine outputs at high speed. But that dynamic doesn’t automatically extend to domains constrained by physical systems, regulation, or long experimental cycles.

Tight feedback loops don’t automatically extend to domains constrained by physical systems.

“One of the things models can already do is postulate new ideas for particle physics experiments,” he said. “And the problem with particle physics experiments is that they take a lot of expert technical labor to set up and run, and they require the use of extremely expensive infrastructure. So there really isn’t a convenient way—other than publications in the scientific literature—to get the output of those experiments and feed it back into an actual model.”

3. Software velocity ≠ organizational velocity

AI is dramatically accelerating software development production, but that doesn’t mean organizations can suddenly move faster. In many cases, speeding up code generation simply exposes the slower-moving bottlenecks that were already present: deployment, integration, governance, and organizational change.

“I build a lot of prototypes that are greenfield, where I have no constraints whatsoever,” noted Scott. “I just get an idea and there’s nothing stopping me from using an agentic coding system to produce a brand-new thing. But in many cases, the things we want to produce are fairly highly constrained.”

When things are moving this fast, it’s hard for people to notice the change and snap to.

Scott pointed to last-mile problems, the need for a lot of plumbing work, and human psychology as throttling issues. He also acknowledged the forecasting problem we will inevitably face as things move exponentially faster: “A lot of the stuff I’m doing right now using agentic coding to build things wasn’t even possible in November of last year,” he said. “When things are moving this fast, it’s hard for people to notice the change and snap to.”

4. Activity ≠ value

“Just because you’re using AI to create a lot of activity doesn’t necessarily mean that the activity you’re creating is valuable,” Scott said.

The ability to generate enormous amounts of output doesn’t guarantee meaningful impact. As AI lowers the cost of creation, the defining question shifts from, “How much can we produce?” to: “What is actually worth building?”

We have to pay close attention to how we measure value.

“We can have a lot of output, we can build more complex things than we built before,” added Scott. “That doesn’t necessarily mean that the things we’re building are super valuable. When they go into a user’s hand, are they solving a real problem? As developers, we have to pay especially close attention to how we measure value and to the feedback we get on the work that we’re doing.”

5. Autonomy ≠ trust

AI systems are becoming increasingly capable of operating autonomously, but autonomy alone does not create trust. Real-world deployment still requires governance, identity, access control, transparency, and meaningful human oversight.

That’s a new way of thinking about software.

“You’re always going to have human oversight, so this notion of autonomy is a little bit of a pipe dream,” Scott noted. “You have to build systems doing complex things in a way where people can trust that they’re doing them correctly and in a way that’s aligned with their interests and values. And that’s a new way of thinking about software.”

Bridging the gap between capability and reality

Ultimately, said Scott, “There’s a lot of work for all of us to do over the next months and years to fully unlock the potential of this crazy tool that we’ve built collectively. These problems that I enumerated don’t go away just as a function of scaling up an AI model. There is no silver bullet. That means there’s a bunch of technical work to be done, a bunch of societal work, a bunch of organizational work, and just dealing with legacy systems and plumbing.”

AI capability gains will continue, but turning those gains into trusted systems that create meaningful, durable value is the harder and more important work. And that work starts today.

“We need to engage more intensely than ever before,” said Scott, “because we can see the promise of this technology to benefit the world if we’re able to overcome these obstacles.”

Introducing Command Line, and the new rules for builders

Welcome to Command Line, a new blog where we’ll share what Microsoft builds, how our technical teams operate, and what we learn along the way. Build 2026 felt like the right time to roll this out, as we bring together engineering leaders from around the world to dive deep on building, deploying, and operating scalable AI systems. For those on the ground in San Francisco, you’ll notice that the vibe has shifted along with the locale. And that seems only fitting since the entire SDLC has changed dramatically, too.

To kick things off officially on Command Line, I thought I’d share some of the things we’ve learned about building in the agentic AI era. Things are evolving quickly, and the old rules no longer apply. Manual review processes are breaking under a flood of PRs that our workflows weren’t designed to accommodate. Build time iteration is being met with runtime learning loops as agents improve post-deployment. And the focus is shifting from shipping code to orchestrating systems.

Things are evolving quickly, and the old rules no longer apply.

This is more than a productivity boost. It’s an entirely different relationship with code, tools, and decision-making. Throw out the old playbook. We need to develop a new set of rules for builders.

Here are 10 things that feel important and true today. Time will tell how durable they prove to be. We’re in a time of seismic disruption, so we all need to constantly challenge our assumptions.

1. Build agent-first by default

As you develop your proficiency with AI tools, you’ll probably find yourself reaching for the Copilot CLI in GitHub or agent mode in VS Code more often than not. Many senior engineers on our own teams and across the industry no longer write code by hand, and even small tweaks are made via agents when it’s more efficient.

2. Context and skills are your most important asset

When you first start using an agent in your repo, you’ll notice long sessions with higher failure rates. Start by prompting the agent to populate the knowledge base of Markdown files about your repo, verify the output to ensure correctness, and then set up a continuous improvement agent so that knowledge base memory is updated after each agent session. Things start to compound rapidly.

If you’re doing something repeatedly with an agent, wrap it into a reusable skill and share it. Team skills compound the same way code libraries do. If the same agent failure shows up twice, promote the correction into a reusable skill, test, eval, prompt, or workflow with a clear trigger.

3. Plans are the real work

When you invest in a good plan, the agent can often one-shot the implementation. Shift your energy from typing out code to shaping clear, scoped roadmaps. Human judgment lives in the plan. Execution runs on autopilot.

4. Prototypes replace detailed PRDs

Learning by doing should become the default. Experiment with live demos and prototypes to establish ground truth and guide your decisions before you commit to building. Think demos, not memos.

5. Taste, not time, is the crucial limited resource

When building is cheap, the discipline of deciding what’s worth building becomes more critical, not less. Product judgment and prioritization are the highest-leverage skills on the team. Additionally, with near-zero cost to prototype, deciding what to build now includes seeing concrete options upfront.

6. Tackle the important but overlooked

AI-forward velocity reclaims bandwidth for critical, high-value engineering debt and operational tasks that usually get pushed aside. That includes finding and fixing high-value sentry errors, repairing broken telemetry dashboards, running sentiment analysis on dogfooding feedback, and operationalizing more rigor in your data quality.

7. Tests are your safety net

At high velocity, good test coverage is what prevents you from shipping regressions. Invest in test quality the same way you invest in feature velocity.

8. Don’t let code review become a bottleneck

When small teams can ship hundreds of PRs every month, staying close to the codebase requires deliberate effort. Start using and trusting agentic code review tools like CCR, shifting first-pass code reviews to agents, and keeping humans in the loop for architectural oversight.

9. Everything’s changing, not just code

Build pipelines, verification, triage, planning, and team rituals all need to evolve. Longer shipping cycles gave you more time to uncover bugs. Shorter shipping cycles and a higher pace of code turnover decrease that buffer. As AI-assisted PR volume increases, you need automated, fast quality gates to keep up.

10. Code is disposable

Don’t be afraid to throw code away or rewrite it. For well-bounded features, the spec might become the durable artifact. And because code is disposable, you can be brutally honest in review and ditch what doesn’t serve the product. Embrace an egoless culture.

Taste should be the sharpest tool in your arsenal.

Bonus: Another word on taste

In the agentic AI era, taste is the ultimate differentiator and should be the sharpest tool in your arsenal. But don’t just set it and forget it. Keep coming back and feeding the flywheel. Human taste can be captured once in curated examples, then enforced continuously across every agent trajectory. The compounding improvement loop means that as you give feedback, the bar keeps rising.

That’s how you get high-quality code without humans writing every line.

Microsoft Scout: From personal project to enterprise-ready personal agent

Early this year, OpenClaw demos were seemingly everywhere, though many seemed to amount at best to a cool party trick (“Look, my agent ordered a pizza”). But it got long-time Microsoft employee Omar Shahine thinking: How useful could claws actually be?

Very, it turned out. In his spare time, Shahine created Lobster, a personal AI assistant built on OpenClaw. It has its own Apple ID and email address, so he can text with it from any device with iMessage. He initially split Lobster into a trio of agents, each with its own security profile and tool access (eight weeks in, that number had increased to nine always-on agents). Lobster handles travel logistics, proactively sends family reminders ahead of time, and generally helps Shahine and his family stay organized and get things done. And after presenting Lobster to Microsoft’s AI Accelerator group, it landed Shahine a new job: bringing OpenClaw to M365 and the cloud as CVP of what was deemed “Project Lobster.”

At the same time, Microsoft Member of Technical Staff Jakob Werner was pursuing a similar idea with a twist: a desktop app-based agent inspired by OpenClaw. The goal was to deliver a powerful enterprise-secure personal AI assistant that anyone within Microsoft could use. In just a couple weeks, what was referred to internally as “Clawpilot” had already been downloaded by thousands of Microsoft employees, and that community continues to grow.

When Shahine started assembling a small team of enthusiastic builders—Ocean’s 11, naturally—Werner quickly joined their ranks. The two recently caught up in Redmond, Washington, to compare notes on building these always-on, autonomous agents and navigating the worlds of enterprise security, agentic memory, and more.

Embracing the spirit of open source

The Project Lobster team is representative of a new way of working within Microsoft, fueled by AI advancements. It’s a tight-knit group that prefers to collaborate asynchronously. There’s a general consensus against meetings. Everyone contributes to the codebase, including Shahine. And there’s no traditional executive assistant among their ranks: Each team member actively uses prototypes throughout the day to fully immerse themselves in the tech as they’re building it. There’s even a growing open-source community around the team that mirrors what’s found with open-source projects outside Microsoft’s walls.

“I’ve never seen a project inside the company where so many people showed up with their ideas and their code and did the work to produce a PR,” says Shahine.

I’ve never seen a project inside the company where so many people showed up with their ideas.

In fact, internal excitement around Project Lobster has been such that the team fielded pull requests (PRs) left and right during the early building phase, which they reviewed to determine whether they met the bar to make it into the product. Even some of Shahine’s changes didn’t make the cut. The focus had to remain on the central goal of the product: Creating an always-on personal agent for work. An AI helper that learns your goals, adapts to your daily work patterns, and acts with context, identifying issues before they surface, keeping projects on track and driving outcomes without constant input. An agent that can detect when a calendar is overbooked and propose specific changes before the week begins or identify when a decision is stalled and draft a targeted follow-up to unblock it.

“We have to determine if a given PR changes the central idea of the product or not—and the speed of that review is human speed, not AI speed,” notes Werner. “Anyone can make a PR super quickly now. We’re trying to help the community and teach contributors how to review PRs.”

While the work began as an internal experiment, it quickly turned into a customer-focused effort that’s culminated with the introduction of Microsoft Scout—an always-on personal agent powered by OpenClaw open-source technology.

From experiment to enterprise-ready product

Microsoft Scout operates autonomously—with its own identity—acting on your behalf. It works across cloud, desktop, and web browser, so it can connect across the surfaces you use—Teams, Outlook, OneDrive, and SharePoint—and the systems where work lives, including email, calendar, and contacts.

Unlike your average claw in the wild, Microsoft Scout combines OpenClaw code with enterprise identity, governance, and security. Every package is ingested through a curated, signed Microsoft supply chain, and every tool call, model request, and network hop is mediated by a zero-trust runtime—the agent’s container is treated as untrusted, with Microsoft-controlled identity, tokens, and policy sitting outside it. With Agent 365, admins get a single control plane, and Microsoft Purview gives security teams the same compliance and DLP signal they already get from other M365 surfaces.

“It’s a super powerful tool,” acknowledges Werner. “And to be enterprise secure, we needed to make sure the data governance was right, that the privacy was right, and that it doesn’t cancel a meeting and send all your personal information to that email chain. If I send my agent to you, it shouldn’t tell you everything about me. These areas are possible to contain, but we also had to do it in a balanced way that doesn’t restrict the possibilities down to nothing.”

It’s a tradeoff worth making. And with Microsoft’s tried and trusted enterprise security offerings and ongoing research and innovation in the space, the team had a solid foundation from which to address the challenge.

The role of agentic memory

In order for an always-on personal AI agent to be truly useful, it needs to be proactive—and that requires context powered by Work IQ. Over time, Microsoft Scout understands the way you work, uses the same productivity tools you use, and takes things off your plate without the need for constant prompts. It learns your goals, adapts to your daily work patterns, and acts with intent. Unlike previous technological waves, this is software that’s truly personalized. That’s transformative, but it’s not without tradeoffs.

“OpenClaw, Claude Code, GitHub Copilot CLI, these are agentic coding harnesses that are basically remembering—writing things down just like people do,” Shahine notes. “They write things down like a diary. But just like it needs to remember things, it needs to forget some things, too.”

Just like it needs to remember things, it needs to forget some things, too.

As an example, Shahine points back to the introduction of memory to ChatGPT. He spent some time telling ChatGPT that his daughter was 17 while his son was 13. But a year later, that information remained static. The system didn’t have a concept that some facts need to change over time, while other pieces of information—like your name—will stay exactly the same.

“In the design phase, I was thinking about the human and how humans memorize things,” says Werner. “I forget things that are irrelevant because I didn’t use them. So I built a system where, if I’m going to use it repeatedly, it’s going to stick. But if I’m not going to use it regularly, I want the system to forget. I don’t want to have an infinite diary of things, right? So there’s kind of layers of memory, and it kind of disappears over time if it’s not used. Meanwhile, the relevance of other pieces of memory grows as you use them more.”

Forming a new center of gravity

When they first joined forces, Werner introduced Shahine to the concept of gravity—the framework around which he operated.

“To build a truly great product, I don’t think I can make it myself,” Werner explains. “We need to collaborate with other people. But how do we influence other people to collaborate with us? And the mindset I use and try to instill in my team is gravity. We build something and make it so big in influence—not in the number of features, but in its influence—that when exciting new ideas pop up, they want to try and join the gravity of our work rather than dissolve focus.”

“And I didn’t really know what you were talking about until my new role was announced,” admits Shahine. “But since then, I’ve received hundreds if not thousands of messages from people who want to help, people who want to learn, people who want to show me what they did, and customers who want to know ASAP when they’re going to get their hands on what we’re building. There are a lot of other words for that—user pull, signal—but your mantra of gravity really resonates with me now.”

Microsoft employees have already been using an early Microsoft Scout desktop experience. We built this to learn how always-on agents show up in real work, and we’re seeing it take on coordination, surface risks earlier, and keep work moving without constant prompting.

We’re now extending that early experience to Frontier organizations. Microsoft Scout is available as an experimental release through Frontier, giving customers a chance to explore how it can fit into their own workflows.

Access requires Frontier enrollment, Intune policy configuration, and an opt-in attestation. Users with a GitHub Copilot license can then download and install the experience. Learn more.

How Excel got agentic: An interview with Microsoft Director of Science Mukul Singh

When Mukul Singh made the jump from pure research into product, it was a leap of faith. But he had an idea that he wanted to bring to life: delivering agentic AI capabilities in Excel.

While this was well before buzzwords like “the agentic AI era” had cultural cachet, the research was already headed in that direction. So armed with a prototype and a healthy dose of ambition, Singh made his pitch.

Two years on, he’s fully transitioned from his role in Microsoft Research to a new gig in the Office Product Group and successfully delivered the ability to edit with Copilot in Excel (previously known as Excel Agent Mode). Not ones to rest on their laurels, the team quickly went on to ship agentic capabilities in PowerPoint, Word, and Outlook. It’s changing the way people get work done—and it started with the hypothesis that Excel, at its heart, is a low-resource programming language.

We sat down with Singh to learn more about his journey from research to product, the science and research behind Office’s new agentic capabilities, and why Excel was the perfect testbed for agentic AI.

Command Line

To kick us off, tell us a little about yourself.

Mukul Singh

I’m a researcher in the Office Product Group team. I recently started a science team focusing on agents and AI for the Microsoft Office product portfolio. I was originally in Microsoft Research, working on research full-time, publishing papers, and very far away from the product world. To be honest, it’s been an incredible journey getting to go from research to then being embedded in the product space.

Command Line

How long ago did you make that transition?

Mukul Singh

It was only two years ago—and right at the cusp of when the Excel Agent Mode work started. In fact, that was the catalyst. I had no intention of ever moving out of academia and deep research. And, you know, Microsoft Research is one of the best labs in the industry that’s still connected to academia. So I was pretty okay with my life. I didn’t want anything else. It was the most perfect blend that I could hope for.

And then this project in Excel started, and there were some initial discussions. I thought it sounded interesting, because one of the things in research that you always feel is missing as a gap is that you don’t see your work actually delivering value. You see it deliver a lot of theoretic value—you see its shape and the direction of the world, per se. Other people might take your research direction and extend it in meaningful ways. Research does shape society in a way, but you don’t see any immediately measurable impact.

So at that point, I felt like I wanted to work on something where I could see that happen—where I could watch the impact unfold in front of me.

Command Line

What are some of the research questions that ultimately led to Excel Agent Mode (which we’re now calling edit with Copilot in Excel)?

Mukul Singh

My research in MSR and all of my papers previously had all been about AI for low-resource programming languages. I like to explain low-resource coding languages as languages that are very obscure and make up less than like 0.1% of the entire coding community. So AI models are generally bad at them because they just learn off the internet and known sources.

To give you an example, internally a Microsoft there’s a language called Kusto Query Language (KQL). That’s just used for telemetric querying. Now, it’s a public language—it’s published. But no one outside of Microsoft uses it because why would they? We designed it for our database systems. So that’s the type of languages that the models are bad at.

All of my research was looking at how we could make models good at these languages, which they are not naturally good at. We need to drop in hints, cues, documentation, give it retry mechanisms and everything.

When I was initially approached about this work in Excel, I was at first very skeptical of how that and my work might be related. I’ve done a lot of tabular research, sure, but it’s not Excel. But then they drew the parallel that, actually, Excel is just another low-resource language programming. It has its own internal language. It has its own engine. It’s just that the model doesn’t know it.

Excel is just another low-resource language programming. It has its own internal language. It has its own engine. It’s just that the model doesn’t know it.

– Mukul Singh

Command Line

Yeah, in your LinkedIn post, you talked a little bit about how Excel functions like a low-resource programming language. What really surprised you about the project when you dug in?

Mukul Singh

So the vision originally for Excel Agent Mode—and, by the way, kudos to the team and their thinking. This was before any agents. Today, everything is agents, right? You can automate, you go to an app and assume that there’s a button somewhere that will do it for you. But that didn’t use to be the case. Excel, I just didn’t think of it as an AI-forward app. I thought of Excel as an app that adapts, right? That doesn’t need AI. But the vision at that point was that the team wanted to automate end-to-end user workflows.

If I open Excel, anything I have in mind that I want to do, this agent I’m building should be able to do it. That was a very difficult vision to achieve. In fact, that’s the reason that this project took so long and had so many setbacks—because we set our ambition up so high, before the models even had the capability to be good enough to do anything like this at scale.

Say you want to add a pivot table, right? That’s a common task. I don’t know how to do it, but I know people do it. It’s just like a five- to 10-line code snippet that, wherever someone clicks the button, “Create pivot table,” internally that code is run. Now, we could connect the model and it could generate all of that code, which is just some random JSON and JavaScript objects. To us, it’s like gibberish text, but it’s completely deterministic and exactly controls the behavior. And the good thing about Excel is that all of its surface is programmable.

This is not true for all apps, by the way. Very few of Microsoft’s apps are truly programmable, but Excel—you have to hand it to the engineers 40 years ago when they were setting it up: They made sure that everything is programmable end-to-end.

Command Line

What are some of the twists and turns that the work has taken over the last two years?

Mukul Singh

When we started this project, the vision really hit home. We had insane videos that the product managers and designers made, showing how the product is today. But this was two years ago, right? And the videos were the same.

So you see the disconnect was that everyone, leadership bought in. Like, this is the future. We want to invest in this. It’s the right thing to do. And we got a very strong crew of people. I was brought in. We were all working on it, but the reality was that the models weren’t good enough. This was before the age of the racing models. The best model we had to work with was 4.0, and anyone who has played with models knows that 4.0 just cannot do long chain tool calling.

It used to collapse after a couple of calls, and it was only able to do things like start a formula, start a problem, which at that time was still cool and—bless our marketing team, they were able to make that such a good feature with the help of design and everything. But even that, to us, felt like, “Oh, we’re falling short.” And they were like, “No, this is still great.”

The industry just wasn’t here yet. That was, I feel, the weakest moment of the project where there was this doubt. Like, is this just too aspirational? Is it just not possible? Did we predict the future wrong, or will the models not even get there for like 20 years? We just didn’t know.

We had promised a lot to customers on the backing of these researchers’ point of view, and now we were in a space where we didn’t feel confident that we could deliver. There were debates about whether we should just cut this project and rebrand it into something completely different.

So it was indeed quite the journey. We were shooting way above our weight class with very little evidence other than just the intuition of maybe three people who said, “We think this is where the world will end up.”

Command Line

When did you get a sense that the models were headed in the right direction in terms of reasoning and long chain tool calling?

Mukul Singh

I think that’s a very good question. When OpenAI announced their first series of reasoning models, that was a pivotal moment for us. We tried the model, and it couldn’t solve any of the promised scenarios we had. But the traces showed signs of life. Like in the pivot table example: It tried to generate the pivot table code. It ran it—it just gave back an error that the area where you’re trying to create the pivot table already has data in it.

Now, the previous models at this point would either keep looping in this same error, just give up, or do some other random gibberish text response and not recover. But o1 tried to recover. That was the first sign of life that, OK, this is now working in a chain. It does something, looks at feedback, and tries to recover and keep trying until it succeeds.

It went up to like 20, 30 tool calls, still not able to solve real complex tasks. But we thought that if we reinforced it with the right information, it might be able to do something.

Command Line

This is backtracking a little bit, but when you ran up against the limitations of the models—when it seemed like maybe you were headed down the wrong path and the research had it wrong—what kept you going?

Mukul Singh

I’ve been in the space of generating low-resource programming languages for a while. I had even published papers for Excel in general, so I know the ecosystem. Everything’s built on top of Excel. So based on all of the research, we had a very strong intuition that this is where the models were headed. And our parallel was very simple: that coding agents, like people using GitHub Copilot and VS Code, and Excel should intuitively be no different.

Their implementation and all the UX decisions are different, but they are the same in principle—it’s a surface in which I do manual work. I want that automated, and I’m doing that through code. Our parallel was always that Excel is an IDE. People don’t see it as an IDE. They see it as an app, but that’s just because it’s presented as an app. It’s really just an IDE. For analysts, it’s the equivalent of VS Code to a developer.

Command Line

You’ve been talking about this, but I’m going to ask the question directly: What is it that makes Excel a really solid testbed for agentic AI?

Mukul Singh

I truly believe that Excel is the first large-scale enterprise app that was actually automated by an agent—meaning that the agent can take meaningful action in the app. So it doesn’t have fixed workflows. It’s actually controlling it. And what makes Excel the perfect testbed is that it’s programmable. All of it can be controlled by the agent.

It’s not just code, because that code has real consequences in the app that you can see. If I add a chart, I see a chart there, right? If I connect a formula to a different sheet, I see live updates. So it’s kind of a living environment. It’s a live environment in which I’m making changes. And if I make changes, the environment will undergo some change. And I’ll use that as feedback, so there’s observability. And that is the best testbed.

An agent needs an action space that allows you to interact with an environment. And it needs feedback from the environment. Now, this seems trivial, but honestly, very few apps fit this charter perfectly. So this loop made Excel very interesting. All of the research across everything we’ve discussed, it’s been over a span of 17 or 18 research papers all going back two to three years. When someone looks at it at first, they might not even connect the dots. But there’s like three papers on just the best representation of data for Excel sheets. And people will think, “Oh, that’s probably for storage and optimization and context querying. But no, it’s just that we need a concise representation for the model.

Excel actually showed the way for a lot of the apps in the industry in general that this is how an automation pattern works. At least within Microsoft: After Excel, PowerPoint followed. Then Word followed, the same pattern. And Outlook was recently announced—same pattern, same research crew delivering all of that, one after the other, just because all of the fundamentals are there.

Excel is the first large-scale enterprise app that was actually automated by an agent.

– Mukul Singh

Command Line

So your team was involved in expanding the editing with Copilot functionality across the Microsoft Office product suite?

Mukul Singh

Yeah. After my stint at Excel, I kind of figured out that, you know what? This is kind of fun. In fact, it’s very similar to research where you get a vague abstract problem, you have no guidance, and you just try to go solve it. Talk to people. Figure things out. It’s like the same environment, just the difference is that when it actually ships, you feel a sense of accomplishment, and you can point people to it. Like I can tell an Excel guru, “Open that side panel. Try a query.” You’ll be amazed, and I built that.

So the moment that Excel was in a stable state, we handed it off back to the feature team—the PMs and designers—to run with it. And then we moved on to the next surface, PowerPoint, and that seemed even more interesting. Ever since I moved into a Director of Science role inside the Office Product Group, where I’m overseeing all of the agent work, I’ve realized that it’s all based on very similar patterns. And it just so happens that all of my research aligned very closely with where the industry headed because, let’s be honest: Eight years ago, when I was starting my research for low-resource programming languages, at that time, there were just embedding models. At the time, it just seemed interesting. We had no idea the shape that agents would take.

Going into this field, the models were already good at good languages, so what could I contribute there? The only place where I could meaningfully contribute was with the things the models were bad at. So I literally made a list of everything these models struggled with, and it just stood out that programing seemed useful. If it can write good code, someone somewhere will be happy.

Command Line

Were there any hiccups along the way? Did you run into any unexpected technical challenges?

Mukul Singh

At first, we ran into the faster horses trap. We’d show customers these videos and ask, “As an analyst, will you be more productive with this tool?” And everyone said, “It needs to come back with an answer in 30 or 40 seconds. That’s how long I can wait. There’s no way in the world that I’m going to sit there for 10 minutes and leave it running. No one works like that. I’d cancel that operation after 30 seconds.”

Out of every customer we talked to, not a single person had ever said that they were willing to go beyond one minute of wait time.

Command Line

Right.

Mukul Singh

So we were optimizing for that. It needs to complete faster. If the model is making three errors and recovering, that’s not good enough. It shouldn’t make those three errors. And that sent us down paths of optimization that are very tricky. Like today, they’re still not solved—they’re that difficult. But we thought that was non-negotiable for the product.

Then this company called Shortcut came out, and their app ran for 25 minutes on a single query. But the results were amazing. Initially people said, “Who’s going to wait 25 minutes?” But within a week, everyone said, “I now just log in, push the query, and just leave. I don’t need to do like anything.”

And we were like, what? But for every one of our customers that we talked to, their mindset immediately changed. They were asking for faster horses, but then someone delivered them a car. And now they understand that that’s exactly what they needed.

They were asking for faster horses, but then someone delivered them a car.

– Mukul Singh

Command Line

That’s really interesting.

Mukul Singh

The challenges were really around customer expectations. We went with the assumption that all our apps are IDEs. But the hypothesis starts to break down because people use these apps with nuance. And the moment we fail to understand that, it goes from a good to a bad experience. For Excel, it was that people are willing to wait for long-running tasks. But they need auditability. It can’t just work behind the scenes—it has to show its work. It needs to be computed on the sheet.

Similarly, for PowerPoint, people really care about their brands and templates. Initially the version of the PowerPoint agent that we shipped, which if you asked me to rank them, I would say it’s the best version that we built—that was the one that everyone hated. We had optimized for a PowerPoint agent that works with its own independence, like in a vacuum—because that’s the way I use PowerPoint. I start from scratch, no templates. I just have some information that I need to present, and I want to find the best way to do that. But there were other very real constraints that we had to understand. So PowerPoint really pivoted to take into account templates and brand guidelines.

Similarly for Outlook, we found that people really care about confirmation. Like with Visual Studio and all these IDEs, they try to optimize for the least user engagement required, because that’s automation, right? You can leave it running and you don’t have to worry about it unless there’s something that you really need to look at. So they really try to cut down on the model asking something back.

But in Outlook, the story is completely the reverse. People actually love it when the model gives them a list of things that it proposes, and you can just accept or delete each of them in a list. So like:

Send this draft email to this person? Accept.
Create this to-do list? Accept.
Delete these seven emails? Reject.

And, again, it was an anti-pattern.

Command Line

What can you tell us about the tech transfer process and how your research made its way into the product?

Mukul Singh

Yeah, that was actually tricky. I feel like, in general, a lot of people struggle with taking their research and landing it in a product. There’s a very weird gap, because the product teams are very direct about needing to talk to more researchers, and the research teams are keen to get the product group’s attention because everyone wants their work to be used and get funded. But there’s still this weird gap—it’s just not well connected. And the surprising thing is, it’s not because of a lack of will from either side.

It’s because product always comes with so much nuance. The product teams can’t just blindly trust that the research teams are going to understand everything and build it. And for the research team, it’s very hard to put themselves in the product team’s shoes and understand all the constraints. Maybe it needs to run in 10 seconds or less, or it needs to have auditability traces. That’s something that the research teams would feel is a very strange objective.

But at least for Excel, I think the process was smoother because we operated as an integrated team.

Command Line

What was the overall team dynamic like?

Mukul Singh

For Excel Agent Mode, it was a crew of just 10 or 11 people. Everyone had a title, but the titles meant nothing, right? There were PMs checking in PRs. There were devs writing papers. And there were researchers writing specs. There was just everyone on board working as if it were a startup—logged into a conference room and just checking in change after change. That was honestly the most fun time, when it was just people building stuff, coming up with ideas. But then we did do a proper handoff.

Taking the research into the project became easier because everyone discussed everything openly, so everyone understood the same things.

Command Line

What avenues for future research did this work open up?

Mukul Singh

I think a lot. Before this, I don’t think people were really thinking of agents for things that normal people use as apps. That just wasn’t a concept. And Excel has already given way to PowerPoint, Word, and Outlook— which are all entirely different surfaces, all just automating the app completely. There was significant uptake, even outside Microsoft, when people saw that, not only is it possible, but people love it.

This project was a testbed that showed two very distinct things: that it was possible to automate an app through these models, which previously was not the accepted truth, and that there was a lot of appetite for it.

Excel in its early days had a lot of flaws in the agent. But people overlooked a lot of that just because of those scattered moments of brilliance where they were like, “Oh my god, it did something that I couldn’t even have imagined.”

This has opened the door so that, I think over the next couple of years, all of these apps that we use day to day will have an agent—not to replace it, but that will be our go-to whenever we struggle to get something done. It showed the path, that this is possible for other apps, regardless of their complex surfaces.