
AKS VMSS node pools: 3 config models and real-world lessons with deallocated nodes
August 5, 2025General Availability of Auxiliary Logs and Reduced Pricing
August 6, 2025Batch users often encounter issues like nodes suddenly gets into unusable state due to high CPU or Disk usage. Alerts allow you to identify and address issues in your system. This blog will focus on how users can enable alerts when the node is consuming high amount of disk by configuring the threshold limit. With this user can get notified beforehand when the node gets into unusable state and pre-emptively takes measures to avoid service disruptions.
The task output data is written to the file system of the Batch node. When this data reaches more than 90 percent capacity of the disk size of the node SKU, the Batch service marks the node as unusable and blocks the node from running any other tasks until the Batch service does a clean up. The Batch node agent reserves 10 percent capacity of the disk space for its functionality. Before any tasks are scheduled to run, depending on the capacity of the Batch node, it’s essential to keep enough space on the disk.
Best practices to follow to avoid issues with high disk usage in Azure Batch:
When the node is experiencing high disk usage, as an initial step you can RDP to the node and check how most of the space is consumed. You can check which apps and files that are consuming high disk and check if these can be deleted.
A node can experience high disk usage on OS disk or Ephemeral disk. Ephemeral disk contains all the files related to task working directory like the task output file or resource files whereas OS disk is different. The default operating system (OS) disk is usually 127 GiB only in Azure and this cannot be changed. In Batch pools using custom image, users might need to expand the OS disk when the node consumes high OS disk.
After you have allocated extra disk on the custom image VM, you can create a new pool with the latest image.
If you want to clear manually files on node, please refer Azure Batch node gets stuck in the Unusable state because of configuration issues – Azure | Microsoft Learn
Switch to higher VM SKU
In some cases, just creating a new pool with higher VM SKU than the existing VM SKU will suffice and avoid any issues with node.
Save Task data
A task should move its output off the node it’s running on, and to a durable store before it completes. Similarly, if a task fails, it should move logs required to diagnose the failure to a durable store.
It is users’ responsibility to ensure the output data is moved to a durable store before the node or job gets deleted.
Persist output data to Azure Storage with Batch service API – Azure Batch | Microsoft Learn
Clear files
If a retentionTime is set, Batch automatically cleans up the disk space used by the task when the retentionTime expires. i.e. the task directory will be retained for 7 days unless the compute node is removed or the job is deleted. This action helps ensure that your nodes don’t fill up with task data and run out of disk space. Users can set this to low value to ensure output data is deleted immediately.
In some scenarios, the task gets triggered from ADF pipeline that is integrated with Batch. The retention time for the files submitted for custom activity. Default value is 30 days. Users can set the retention time in custom activity settings from ADF pipeline.
Now let’s see how to get notified when a Batch node experiences high disk usage.
Step 1: You are first required to follow below doc to integrate Azure Monitor in Batch nodes. The Azure Monitor service collects and aggregates metrics and logs from every component of the node.
Step 2:
Once the AMA is configured, you can navigate to the VMSS in portal for which you enable metrics. Go to Metrics section and select Virtual Machine Guest from Metrics Namespace.
Step 3: From the metrics dropdown you can check metrics for the performance counter you wish.
Step 4: Now navigate to Alerts section from Menu and create alert rule.
Step 5: Here you can select any performance counter as you wish for which you want to receive alerts. Below shows creating a signal based on percentage free space that is available on VMSS.
Step 6: Once you select the signal it will ask you to provide other details for alert logic. Below snapshot shows alerts triggered when average of percentage free space available on VMSS instances is less than or equal to 20%. This alert evaluates for every one hour and check the average for the past one hour.
Step 7: You can proceed with the next steps and configure your email address and Alert rule description to receive notifications.
You can refer to below document for more information on alerts.
Create Azure Monitor metric alert rules – Azure Monitor | Microsoft Learn
In this way users can enable alerts to get notifications based on metrics for their Batch nodes. Below is a sample email alert notification.