Writeback for Cloud-Managed Remote Mailboxes: Now in Public Preview
May 16, 2026[Launched] Generally Available: Azure Blob Storage SDK for Rust
May 16, 2026Microsoft’s Customer Zero blog series gives an insider view of how Microsoft builds and operates Microsoft using our trusted, enterprise-grade IQ platform. Learn best practices from our engineering teams with real-world lessons, architectural patterns, and operational strategies for pressure-tested solutions in building, operating, and scaling AI apps and agent fleets across the organization.
The Challenge: Operating a global-scale physical network
Azure operates one of the largest physical networks on the planet. That scale shapes more than infrastructure decisions; it reshapes how operational work itself must be organized.
Hundreds of thousands of kilometers of outside-plant fiber and more than a million optical devices connect datacenters, regions, and Microsoft’s global services. Every customer interaction with Azure ultimately depends on this physical network operating correctly, continuously, and at speed.
As the network has grown, the nature of the problem has shifted. Detection, monitoring, and traffic rerouting are now highly automated and fast. Where teams still struggle is everything that comes next: coordinating physical repairs, tracking progress across systems and vendors, validating outcomes, and keeping work moving until an issue is truly resolved.
Under pre-AI operating models, this scale of coordination demand grew faster than teams could realistically adapt. The constraint was no longer technical routing or signal processing. It was the amount of human attention required to keep distributed work aligned over hours and days.
This is where most operational effort accumulates. When incidents arise, operational overhead starts to compound as more experts need to coordinate:
- Field operations, hardware replacements, and incident remediation can take increasingly more time when multi-company, multi-region coordination is involved.
- Engineers spend a disproportionate amount of time waiting for updates, following up, validating fixes, and translating context across systems.
Unlike previous scale challenges that could be addressed with automation as code, the “messy middle” of operations is non-deterministic by nature. It consists of judgment calls, incomplete information, and asynchronous dependencies. At Azure’s scale, coordination becomes the limiting factor.
The Solution: Treating coordination as a first-class engineering problem
Instead of adding more scripts or expanding brittle automation, we redesigned how coordination work gets done by making AI agents first-class participants in day-to-day operations. In earlier phases of this transformation, it was easy to treat agents as tools, but we have since evolved to embed them as part of the system itself.
This was not an overnight process. We had to evolve the approach over time:
- We started with conversational copilots that enabled engineers and technicians to query device state and telemetry using natural language, reducing friction in day-to-day troubleshooting.
- Eventually, we grew to deploy autonomous workflow agents that take action toward goals for specific operational processes end to end.
Autonomous workflow agents act similar to digital coworkers, integrating their work alongside 10,000+ employees like data center technicians, network engineers, hardware engineers, etc. They are assigned goals and can carry context across hours or days to drive scoped work all the way to completion. This ranges from fiber remediation or Return Merchandise Authorization (RMA) all the way to the orchestration of datacenter deployment. In practice, they’re incredible execution engines that aim to minimize human cognitive load, only relying on humans for high-risk judgement calls.
Agents work alongside engineers and technicians inside operational channels like ticket queues, telemetry systems, Teams, and email. This allows them to stay grounded in the same workflows. As we iterate through tasks and feedback loops, we continually manage and govern knowledge bases that agents can rely on for day-to-day operations. We aggregate operational data, runbooks, and institutional knowledge into logical segments that agents can act on with more consistency. Alongside these foundations, Work IQ and Fabric IQ help bolster responses with connections to organizational context as work progresses.
We structure autonomous workflow agents in an agent organization and treat them similarly to a digital coworker; they’re governed by an internal control plane that uses defined identity, roles, skills, policies, and auditability.
Roles, permissions, and policy for agents vary by agent class and risk level rather than being applied uniformly. But agent permissions never override human accountability and control as a central priority. Humans continue to define goals, policy, and success criteria as boundaries. Any changes that are high-risk or irreversible can’t be done without explicit approval from a human expert. Similarly, when agents encounter ambiguity and edge cases, they escalate to a human for a decision rather than risk a guess. Ultimately, agents within the system don’t sacrifice human control over scope and implementation on actions that affect important components. Our experts still exercise keen judgment on policy decisions and system design, just with increased focus and decreased noise.
Beyond establishing guardrails, this also allows us to maintain the right balance between agent scale and cost. Think of the agent organization as an auditable, policy-driven inventory of agents. Some agents may run for longer periods of time to perform scheduled checks and search for anomalies while others are spun up on demand to scale with issues as they happen. This way, we make a direct relationship between the number of incidents and the number of working agents at any given moment.
To understand how this works in practice, consider a real example of a fiber break incident impacting Azure infrastructure in Southeast Asia. Once the break was identified, an autonomous agent was instantiated with full context about the incident. The agent corresponded via email and Teams with a regional fiber provider and field technicians across multiple languages and systems, requesting updates on a defined cadence and validating repairs attempt against live telemetry. When the first fix by technicians had failed, the agent escalated back to the technicians with clear feedback on the unsuccessful fix. Technicians were able to re-attempt the fix based on the agent’s feedback, notified the agent, and the agent confirmed restoration for involved parties once it completed successful testing. All of this was performed within the same systems and communication channels our employees use. This makes response time incredibly fast and natively recording updates as simple as possible.
This workflow involved roughly 14 interactions over about 9.5 hours, without a human engineer needing to actively manage each step. Engineers remained accountable for decisions and outcomes, but the coordination work progressed continuously without manual follow-up or handoffs from humans. This model does not replace ownership, but it does change how our workforce orchestrates and governs operations.
The Impact: What changes when agents own coordination
Several changes follow when agents become the default coordination layer across incidents, repairs, and workflows:
- Coordination across vendors, regions, and systems becomes consistent and continuous.
- Updates are validated against live telemetry instead of being trusted blindly.
- Failed actions are detected earlier and retried until success criteria are met.
- Long-tail incidents caused by stalled handoffs are significantly reduced.
This is what leads us to transformational outcomes that allow agents to help our teams scale work to a previously unimaginable size:
- 2x faster time to mitigate on issues like fiber-repair workflows.
- Up to a 78% reduction in manual effort by offloading operational toil to agents.
Human and agent work happens in the same channels in tandem, making info parity and actionable handover nearly instantaneous. Engineers remain firmly in the loop, but they no longer have to micromanage each step. Instead, they guide outcomes, intervene on edge cases, and shape how systems respond over time.
But ultimately, embedding agents into everyday work is about creating systems that learn. Beyond reactive response, agents also act as a second set of eyes to surface recurring issues and weak signals that inform both day‑to‑day operations and future design in our networks and datacenters. With each iterative cycle through feedback loops, the network gets stronger while agents get smarter in fixing and preventing issues.
Key learnings and transferable practices
Our experience taught us several lessons for designing, operating, and governing global-scale systems:
Start with success in conversational agents, then evolve agents to deliver actions. Agents create real leverage when they hold context over time, stay with an issue, and close loops without constant prompting. That removes waiting and handoffs while keeping ownership clear.
Engineer agents to work in common, accessible channels. At enterprise scale, simplified coordination is critical to delivering timely, correct outcomes. Explicitly embedding agents in the same communication and record systems keeps action faster, smoother, and clearly recorded.
Define guardrails and agent organizational policies clearly and early. Clear approval points and role-specific permissions allow agents to act with confidence inside known boundaries. Humans remain accountable for decisions that have lasting impact.
Measure impact where operations feel it. The clearest signals are the same signals we have always strived for: faster mitigation, fewer stalled incidents, and shorter time to repair. To capture agentic impact, measure how much of the process agents complete autonomously on the way to those outcomes.
Looking forward
We’re continually evolving this model with a focus on responsible scaling, including governance, trust, and cost discipline.
Over time, this approach is shaping systems that not only recover faster but also learn from operational signals and feed those lessons back into how the network is designed and operated. Ultimately, we’re making a system that has a new level of autonomy in managing and healing itself.
The broader takeaway is not about a specific platform or product. It is about what becomes possible when AI agents operate with humans inside real production systems. With agents as an engine of scale against operational toil, humans are able to coordinate expertise at a faster pace in broader spaces than ever before while still remaining in control of direction and outcomes.
You can learn more about agents and Azure Networking at: