Accelerating AKS troubleshooting with the Azure Copilot Observability Agent

Anatomy of the change

June 18, 2026

[Launched] Generally Available: ICMP Support for Azure Standard V2 NAT Gateway

June 18, 2026

Published by azurefeeds on June 18, 2026

Why AKS troubleshooting is complex

Troubleshooting Azure Kubernetes Service (AKS) is complex because failures can originate in workloads, platform components, infrastructure, or the application code running on the cluster. For example, pods stuck in Pending may indicate capacity or scheduling issues, while application latency may be caused by throttling, failed probes, pod restarts, or node pressure below the app.

During an incident, simply having more telemetry is not enough. Teams need a way to test likely causes, rule out unrelated signals, and keep the investigation tied to the affected workload and time window.

From signal to root cause: the investigation flow

The Observability Agent follows a consistent investigation pipeline:

Scope the problem by identifying the most likely infrastructure resources involved, plus connected dependencies.

Collect data across metrics, logs, traces, change history, and related signals.

Detect anomalies using learned baselines (for metrics) and log analysis.

Correlate across resources spanning infrastructure and application layers.

Run deep diagnostics by invoking resource-specific tools when needed to pinpoint root cause.

Summarize findings in a structured format: what happened, why it happened, and what to do next.

AKS investigation data sources

The agent works with telemetry already available in your Azure Monitor environment. Investigation depth improves as more relevant signals are enabled, including Container insights logs, Kubernetes events and state, Azure managed service for Prometheus, container and pod logs, Application Insights telemetry for AKS-hosted workloads, Azure Activity Log changes, control plane logs routed through diagnostic settings, and resource metadata for the cluster, node pools, workloads, and related Azure resources.

Figure 1. AKS investigation data sources

You don’t need to enable every telemetry source to get started. The Observability Agent uses the data already available in Azure Monitor, and its findings become more complete as more AKS and application signals are collected.

Example 1: AKS infrastructure — explaining why new pods never start

Consider a workload rollout on AKS where replacement pods remain stuck in Pending state. What looks like a failed release may stem from the workload definition, cluster state, or underlying infrastructure.

Investigation walkthrough

Symptom: rollout is blocked
Replacement pods remain in Pending during rollout, and Kubernetes events show repeated scheduling failures. This indicates that the rollout is blocked before new pods can start.

Workload evidence: scheduling, not startup
Pod state identifies the affected workload, while Kubernetes events show repeated placement failures. The issue is therefore tied to scheduling rather than application startup or container crash behavior.

Cluster evidence: capacity pressure
When enabled, Prometheus node metrics show CPU and memory utilization near capacity. Cluster-level trends show resource pressure increasing at the same time as pending pods and scheduling failures.

Likely cause: insufficient schedulable capacity
The scheduler cannot place new pods because the relevant node pool does not have enough available capacity. The failed rollout is best explained by capacity pressure in the target node pool rather than an application crash or image startup failure.

Recommended action
Scale out the affected node pool or adjust workload resource requests, then retry the rollout once schedulable capacity is restored.

Figure 2. AKS investigation flow

The Observability Agent connects pod state, scheduling events, and node pressure to explain why the rollout is blocked and which capacity action to consider next.

Example 2: Joint app-AKS investigation — tracing application latency to pod restarts

Now consider a customer-facing application where users see increased latency and intermittent HTTP 5xx errors after deployment. The first symptom appears in application telemetry, but the unhealthy requests are served by pods that are repeatedly restarting in AKS.

Investigation walkthrough

Symptom: customer-facing service degradation
After deployment, application telemetry shows increased latency and HTTP 5xx errors. The first visible impact appears at the application layer.

AKS evidence: unstable pods
Affected pods enter CrashLoopBackOff, restart counts increase, and Kubernetes events show back-off restarts, probe failures, or image or command errors. Container logs point to startup exceptions, missing configuration, or crash details.

Resource evidence: workload-specific pressure
Container memory usage approaches configured limits before restarts, while node metrics show no broad node pressure. This suggests the issue is workload-specific rather than cluster-wide capacity related.

Change evidence: deployment correlation
Deployment history shows a new image or configuration change shortly before restarts began, with no matching platform health event. The timing points to the latest deployment or configuration change.

Recommended action
Review the latest image or configuration change, inspect container logs, adjust memory limits, or roll back if needed. Focus remediation on the workload change rather than node pool scaling.

This pattern shows how an application symptom can map back to AKS workload behavior. Application telemetry establishes the user impact, while Kubernetes events, container logs, and resource metrics help explain why the affected pods keep failing.

Operational impact

For site reliability engineers, platform teams, and IT professionals, the Observability Agent reduces the time spent moving between application and AKS telemetry. It brings relevant signals into one investigation, surfaces supporting evidence, and applies Azure Monitor and AKS context so your team can review the findings, validate the recommended path, and decide which production changes to make.

Figure 3. AKS investigation results

Using the Observability Agent

You can start using the Observability Agent from the Azure portal in two common AKS troubleshooting flows:

Investigation mode: Start an investigation from an Azure Monitor alert on an AKS resource or from an Application Insights alert for an AKS-hosted workload. The agent uses the alert context to scope the incident, correlate application and cluster telemetry, and summarize the likely cause with recommended next steps.

Chat-based exploration: Open the Monitor experience in AKS and select the Observability Agent button to chat with your telemetry. Use natural language to ask follow-up questions, explore logs and metrics, detect and inspect anomalies, and narrow down likely causes.

Figure 4. Starting Observability Agent from AKS Monitor experience

Next steps

Azure Copilot Observability Agent overview

Monitor Azure Kubernetes Service with Azure Monitor

Stay connected

Follow this blog for ongoing deep dives, updates on current capabilities, and a preview of what’s coming next.

Live webinar — A walkthrough of real Observability Agent scenarios, best practices, and what’s available today, along with a look at what’s coming next and live Q&A with the product team. Register for the Observability Agent webinar.

We’d love your feedback

The Observability Agent continues to evolve based on real-world usage and operator feedback. Share your thoughts directly through the Give Feedback option in the experience, or reach us at: azureobsagent@microsoft.com

Anatomy of the change

[Launched] Generally Available: ICMP Support for Azure Standard V2 NAT Gateway

Anatomy of the change

[Launched] Generally Available: ICMP Support for Azure Standard V2 NAT Gateway

Why AKS troubleshooting is complex

From signal to root cause: the investigation flow

AKS investigation data sources

Example 1: AKS infrastructure — explaining why new pods never start

Investigation walkthrough

Example 2: Joint app-AKS investigation — tracing application latency to pod restarts

Investigation walkthrough

Operational impact

Using the Observability Agent

Next steps

Stay connected

We’d love your feedback

Related posts

Cumulative Update #6 for SQL Server 2025 RTM

Memory-Optimized Table Variables: Performance Under the Microscope

Updated Solutions Partner designation and specialization pages