How to enable alerts in Batch especially when a node is encountering high disk usage

AKS VMSS node pools: 3 config models and real-world lessons with deallocated nodes

August 5, 2025

General Availability of Auxiliary Logs and Reduced Pricing

August 6, 2025

Published by azurefeeds on August 6, 2025

Best practices to follow to avoid issues with high disk usage in Azure Batch:

When the node is experiencing high disk usage, as an initial step you can RDP to the node and check how most of the space is consumed. You can check which apps and files that are consuming high disk and check if these can be deleted.

A node can experience high disk usage on OS disk or Ephemeral disk. Ephemeral disk contains all the files related to task working directory like the task output file or resource files whereas OS disk is different. The default operating system (OS) disk is usually 127 GiB only in Azure and this cannot be changed. In Batch pools using custom image, users might need to expand the OS disk when the node consumes high OS disk.

Expand virtual hard disks attached to a Windows VM in an Azure – Azure Virtual Machines | Microsoft Learn

After you have allocated extra disk on the custom image VM, you can create a new pool with the latest image.

If you want to clear manually files on node, please refer Azure Batch node gets stuck in the Unusable state because of configuration issues – Azure | Microsoft Learn

Switch to higher VM SKU

In some cases, just creating a new pool with higher VM SKU than the existing VM SKU will suffice and avoid any issues with node.

Save Task data

A task should move its output off the node it’s running on, and to a durable store before it completes. Similarly, if a task fails, it should move logs required to diagnose the failure to a durable store.

It is users’ responsibility to ensure the output data is moved to a durable store before the node or job gets deleted.

Persist output data to Azure Storage with Batch service API – Azure Batch | Microsoft Learn

Clear files

If a retentionTime is set, Batch automatically cleans up the disk space used by the task when the retentionTime expires. i.e. the task directory will be retained for 7 days unless the compute node is removed or the job is deleted. This action helps ensure that your nodes don’t fill up with task data and run out of disk space. Users can set this to low value to ensure output data is deleted immediately.

In some scenarios, the task gets triggered from ADF pipeline that is integrated with Batch. The retention time for the files submitted for custom activity. Default value is 30 days. Users can set the retention time in custom activity settings from ADF pipeline.

Now let’s see how to get notified when a Batch node experiences high disk usage.

Step 1: You are first required to follow below doc to integrate Azure Monitor in Batch nodes. The Azure Monitor service collects and aggregates metrics and logs from every component of the node.

Integrating Azure Monitor in Azure Batch to monitor Batch Pool nodes performance | Microsoft Community Hub

Step 2:

Once the AMA is configured, you can navigate to the VMSS in portal for which you enable metrics. Go to Metrics section and select Virtual Machine Guest from Metrics Namespace.

Step 3: From the metrics dropdown you can check metrics for the performance counter you wish.

Step 4: Now navigate to Alerts section from Menu and create alert rule.

Step 5: Here you can select any performance counter as you wish for which you want to receive alerts. Below shows creating a signal based on percentage free space that is available on VMSS.

Step 6: Once you select the signal it will ask you to provide other details for alert logic. Below snapshot shows alerts triggered when average of percentage free space available on VMSS instances is less than or equal to 20%. This alert evaluates for every one hour and check the average for the past one hour.

Step 7: You can proceed with the next steps and configure your email address and Alert rule description to receive notifications.

You can refer to below document for more information on alerts.

Create Azure Monitor metric alert rules – Azure Monitor | Microsoft Learn

In this way users can enable alerts to get notifications based on metrics for their Batch nodes. Below is a sample email alert notification.

AKS VMSS node pools: 3 config models and real-world lessons with deallocated nodes

General Availability of Auxiliary Logs and Reduced Pricing

AKS VMSS node pools: 3 config models and real-world lessons with deallocated nodes

General Availability of Auxiliary Logs and Reduced Pricing

Best practices to follow to avoid issues with high disk usage in Azure Batch:

Now let’s see how to get notified when a Batch node experiences high disk usage.

Related posts

Generally Available: MongoDB Atlas as an Azure Native Integration

Deploy LangChain applications to Azure App Service

From Healthy to Unhealthy: Alerting on Defender for Cloud Recommendations with Logic Apps