Anatomy of the change
June 18, 2026[Launched] Generally Available: ICMP Support for Azure Standard V2 NAT Gateway
June 18, 2026AKS incidents rarely stay within one Kubernetes object, signal, or tool. A latency spike might first appear in application telemetry, but the root cause may sit elsewhere: pod restarts, node pressure, scheduling failures, or a recent configuration change. The Azure Copilot Observability Agent in Azure Monitor helps connect these signals into an explainable investigation, so teams can move from symptoms to evidence-backed next steps.
Why AKS troubleshooting is complex
Troubleshooting Azure Kubernetes Service (AKS) is complex because failures can originate in workloads, platform components, infrastructure, or the application code running on the cluster. For example, pods stuck in Pending may indicate capacity or scheduling issues, while application latency may be caused by throttling, failed probes, pod restarts, or node pressure below the app.
During an incident, simply having more telemetry is not enough. Teams need a way to test likely causes, rule out unrelated signals, and keep the investigation tied to the affected workload and time window.
From signal to root cause: the investigation flow
The Observability Agent follows a consistent investigation pipeline:
- Scope the problem by identifying the most likely infrastructure resources involved, plus connected dependencies.
- Collect data across metrics, logs, traces, change history, and related signals.
- Detect anomalies using learned baselines (for metrics) and log analysis.
- Correlate across resources spanning infrastructure and application layers.
- Run deep diagnostics by invoking resource-specific tools when needed to pinpoint root cause.
- Summarize findings in a structured format: what happened, why it happened, and what to do next.
AKS investigation data sources
The agent works with telemetry already available in your Azure Monitor environment. Investigation depth improves as more relevant signals are enabled, including Container insights logs, Kubernetes events and state, Azure managed service for Prometheus, container and pod logs, Application Insights telemetry for AKS-hosted workloads, Azure Activity Log changes, control plane logs routed through diagnostic settings, and resource metadata for the cluster, node pools, workloads, and related Azure resources.
Figure 1. AKS investigation data sources
You don’t need to enable every telemetry source to get started. The Observability Agent uses the data already available in Azure Monitor, and its findings become more complete as more AKS and application signals are collected.
Example 1: AKS infrastructure — explaining why new pods never start
Consider a workload rollout on AKS where replacement pods remain stuck in Pending state. What looks like a failed release may stem from the workload definition, cluster state, or underlying infrastructure.
Investigation walkthrough
- Symptom: rollout is blocked
Replacement pods remain in Pending during rollout, and Kubernetes events show repeated scheduling failures. This indicates that the rollout is blocked before new pods can start. - Workload evidence: scheduling, not startup
Pod state identifies the affected workload, while Kubernetes events show repeated placement failures. The issue is therefore tied to scheduling rather than application startup or container crash behavior. - Cluster evidence: capacity pressure
When enabled, Prometheus node metrics show CPU and memory utilization near capacity. Cluster-level trends show resource pressure increasing at the same time as pending pods and scheduling failures. - Likely cause: insufficient schedulable capacity
The scheduler cannot place new pods because the relevant node pool does not have enough available capacity. The failed rollout is best explained by capacity pressure in the target node pool rather than an application crash or image startup failure. - Recommended action
Scale out the affected node pool or adjust workload resource requests, then retry the rollout once schedulable capacity is restored.
Figure 2. AKS investigation flow
The Observability Agent connects pod state, scheduling events, and node pressure to explain why the rollout is blocked and which capacity action to consider next.
Example 2: Joint app-AKS investigation — tracing application latency to pod restarts
Now consider a customer-facing application where users see increased latency and intermittent HTTP 5xx errors after deployment. The first symptom appears in application telemetry, but the unhealthy requests are served by pods that are repeatedly restarting in AKS.
Investigation walkthrough
- Symptom: customer-facing service degradation
After deployment, application telemetry shows increased latency and HTTP 5xx errors. The first visible impact appears at the application layer. - AKS evidence: unstable pods
Affected pods enter CrashLoopBackOff, restart counts increase, and Kubernetes events show back-off restarts, probe failures, or image or command errors. Container logs point to startup exceptions, missing configuration, or crash details. - Resource evidence: workload-specific pressure
Container memory usage approaches configured limits before restarts, while node metrics show no broad node pressure. This suggests the issue is workload-specific rather than cluster-wide capacity related. - Change evidence: deployment correlation
Deployment history shows a new image or configuration change shortly before restarts began, with no matching platform health event. The timing points to the latest deployment or configuration change. - Recommended action
Review the latest image or configuration change, inspect container logs, adjust memory limits, or roll back if needed. Focus remediation on the workload change rather than node pool scaling.
This pattern shows how an application symptom can map back to AKS workload behavior. Application telemetry establishes the user impact, while Kubernetes events, container logs, and resource metrics help explain why the affected pods keep failing.
Operational impact
For site reliability engineers, platform teams, and IT professionals, the Observability Agent reduces the time spent moving between application and AKS telemetry. It brings relevant signals into one investigation, surfaces supporting evidence, and applies Azure Monitor and AKS context so your team can review the findings, validate the recommended path, and decide which production changes to make.
Figure 3. AKS investigation results
Using the Observability Agent
You can start using the Observability Agent from the Azure portal in two common AKS troubleshooting flows:
- Investigation mode: Start an investigation from an Azure Monitor alert on an AKS resource or from an Application Insights alert for an AKS-hosted workload. The agent uses the alert context to scope the incident, correlate application and cluster telemetry, and summarize the likely cause with recommended next steps.
- Chat-based exploration: Open the Monitor experience in AKS and select the Observability Agent button to chat with your telemetry. Use natural language to ask follow-up questions, explore logs and metrics, detect and inspect anomalies, and narrow down likely causes.
Figure 4. Starting Observability Agent from AKS Monitor experience
Next steps
Stay connected
- Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what’s coming next.
- Live webinar — A walkthrough of real Observability Agent scenarios, best practices, and what’s available today, along with a look at what’s coming next and live Q&A with the product team. Register for the Observability Agent webinar.
We’d love your feedback
The Observability Agent continues to evolve based on real-world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience, or reach us at: azureobsagent@microsoft.com