From Student to Startup: Why Microsoft Isn’t Just a Platform—It’s a Partner in Possibility
July 16, 2025Expose Secondary Azure Document Intelligence Service through Azure Front Door
July 16, 2025Introduction
In the ever-evolving landscape of cloud computing, reliability remains paramount. As workloads scale and businesses depend on uninterrupted service, Azure continues to invest in technologies that enhance system resilience and minimizes customer impact in cases of failures.
Azure Compute infrastructure operates at an unmatched scale, with certain Availability Zones (AZs) hosting nearly a million Azure Virtual Machines (Azure VMs) that run customer workloads. These Azure VMs depend on a sophisticated ecosystem of physical machines, networking infrastructure, storage systems, and other essential components. When failures occur at any of these layers—whether from hardware malfunctions, kernel issues, or network disruptions—customers may experience service interruptions. To address these challenges, Azure Compute Repair Platform plays a vital role in identifying, diagnosing, and applying mitigation strategies to resolve issues as quickly as possible.
To further improve our ability to diagnose and resolve failures swiftly and accurately, we present a novel approach —a real-time kernel dump analysis technology aimed at identifying the root cause of issues and facilitating precise, data-driven repairs.
This is an addition to the gamut of detection and mitigation strategies we already leverage. This capability is generally available in all Azure regions and helps our customers out, including our most critical customers.
Real-Time Failure Diagnosis and Repair
We have developed a novel approach to diagnosing and mitigating failures in Azure Compute infrastructure by understanding the state of the kernel on the Azure Host Machine through real-time collection and analysis of Live Kernel Dumps (LKD). This enables us to pinpoint the exact issue with the kernel and use that insight for precise repair actions, rather than applying a broad set of mitigation strategies. By reducing trial-and-error repair attempts, we significantly minimize downtime and accelerate issue resolution.
Kernel dumps can help detect critical issues such as kernel panics, memory leaks, and driver failures. Kernel panics occur when the system encounters a fatal error, causing the kernel to stop functioning. Memory leaks, where memory is not properly released, can lead to system instability over time. Driver failures, often caused by faulty or incompatible drivers, can also be identified through kernel dump analysis.
Importantly, it is the Repair Platform that triggers LKD collection and further consumes the LKD analysis to make informed decisions. By incorporating liver kernel dump analysis into our mitigation workflows, we enhance Azure’s ability to quickly diagnose, categorize, and resolve infrastructure issues, ultimately reducing system downtime and improving overall performance.
Architecture
How does this system work:
- Dump Collection: When an issue is detected, the Repair Platform triggers the collection of a Live Kernel Dump (LKD) on the machine hosting the affected Azure VM.
- Dump Upload: An agent running on the machine monitors a designated storage location for newly generated dumps. When a dump is detected, the agent uploads it from the Azure Host Machine to an online Analysis Service.
- Failure Classification: The Analysis Service processes the uploaded Live Kernel Dump (LKD), diagnoses the root cause of the failure, and categorizes it accordingly—for example, identifying a networking switch in a hung state.
- Persistence: The Analysis Service generates a detailed failure message and stores it in an Azure Table for tracking and retrieval.
- Automated Repair Decisions: The Repair Platform continuously monitors the Azure Table for failure messages. Once a failure is recorded, it retrieves the data and makes an informed repair decision.
Impact
By leveraging this approach, Azure Compute Repair Platform achieves both a better repair strategy and significant downtime savings.
(A) Better Repair Strategy
By precisely identifying failures, the Repair Platform can classify issues accurately and apply the most effective resolution method, minimizing unnecessary disruptions and enhancing long-term infrastructure stability.
For instance, in the case of a VM Switch Hung issue, the Repair Platform attempts to mitigate the problem on the affected Azure Host Machine. However, if unsuccessful, it migrates the customer’s workload to a more stable machine and initiates aggressive repairs on the faulty Azure Host Machine. While this restores service, it does not address the underlying cause, leaving the Azure Host Machine vulnerable to repeated VM Switch Hung failures.
Enabling real-time failure classification, the Repair Platform could instead hold a subset of affected Azure Host Machines in a restricted state, preventing new Azure VMs from being assigned to them. This approach allows Azure’s hardware and network partners to run diagnostics, gain deeper insights into the failure, and implement targeted fixes. As a result, Azure has reduced recurring failures, minimized customer impact, and improved overall infrastructure reliability.
While the VM Switch Hung issue serves as an example, this data-driven repair strategy can be extended to various failure scenarios, enabling faster recovery, fewer disruptions, and a more resilient platform.
(B) Downtime Reduction
The longer it takes to resolve an issue, the longer a customer workload may experience interruptions. As a result, downtime reduction is one of the key metrics we prioritize.
We significantly reduce time to resolution by providing an early signal that pinpoints the exact issue. This allows the Repair Platform to perform targeted repairs rather than relying on time-consuming, broad mitigation strategies.
Sample scenario: When a customer faces issues stopping or destroying an Azure VM, and the problem is severe enough that all repair attempts fail, the only option may be to migrate the customer’s workload to a different Azure Host Machine. Today, this process can take up to 26 minutes before the decision to move the customer workload is reached. However, with this new approach, we are optimizing to detect the failure and surface the issue within 3 minutes, enabling a decision much earlier and reducing customer downtime by 23 minutes—a significant improvement in downtime reduction and customer resolution.
Conclusion
Online kernel dump analysis for machine issue resolution marks a significant advancement in Azure’s commitment to reliability, bringing us closer to a future where failures are not just detected but proactively mitigated in real time. By enabling real-time diagnostics and automated repair strategies, this approach is redefining Compute reliability—drastically reducing mitigation times, enhancing repair accuracy, and ensuring customers experience seamless service continuity.
As we continue refining it, our focus remains on expanding its capabilities, enhancing kernel analysis, reducing analysis time, and strengthening the entire pipeline for greater efficiency and resilience. Stay tuned for further updates as we push the boundaries of intelligent cloud reliability.