BPGI Spotlight | Why Tech Leaders Are Betting Big On Automation First Models
May 30, 2025Monitoring and Evaluating LLMs in Clinical Contexts with Azure AI Foundry
May 30, 2025Introduction
Ensuring reliable, high-performance serverless applications is central to our work on Azure Functions. As new plans and features (like the Flex Consumption plan) continue to expand Functions’ capabilities, we need robust throughput testing. We built PerfBench (Performance Benchmarker) to measure, monitor, and maintain our performance metrics—catching regressions before they impact customers.
Example of what we see monitor and consume –
Motivation: Why We Built PerfBench
The Need for Scale
Azure Functions supports a range of triggers, from HTTP requests to event-driven flows like Service Bus or Storage Queue messages. With an ever-growing set of runtimes (e.g., .NET, Node.js, Python, Java, PowerShell) and versions (like Python 3.11 or .NET 8.0), multiple SKUs and regions, the possible test combinations explode quickly. Manual testing or single-scenario benchmarks no longer cut it. The current scope of coverage tests.
Plan |
PricingTier |
DistinctTestName |
FlexConsumption |
FLEX2048 |
110 |
FlexConsumption |
FLEX512 |
20 |
Consumption |
CNS |
36 |
App Service Plan |
P1V3 |
32 |
Functions Premium |
EP1 |
46 |
Table 1: Different test combinatios per plan based on Stack, Pricing Tier, Scenario, etc. This doesn’t include the ServiceBus tests.
The Flex Consumption Plan
There have been many iterations of this infrastructure within the team, and we’ve been continuously monitoring the Functions performance for more than 4 years now (with more than a million runs till now). But with the introduction of the Flex Consumption plan (Preview at the time of building PerfBench), we had to redesign the testing from ground up, as Flex Consumption unlocks new scaling behaviors and needed thorough testing—millions of messages or tens of thousands of requests per second—to ensure confidence in performance goals and regressions prevention.
PerfBench: High-Level Architecture
Overview
PerfBench is composed of several key pieces:
- Resource Creator – Uses meta files and Bicep templates to deploy receiver function apps (test targets) at scale.
- Test Infra Generator – Deploys and configures the system that actually does the load generation (e.g., SBLoadGen function app, Scheduler function app, ALT webhook function).
- Test Infra – The “brain” of testing, including the Scheduler, Azure Load Testing integration, and SBLoadGen.
- Receiver Function Apps – Deployed once per combination of runtime, version, region, OS, SKU, and scenario.
- Data Aggregation & Dashboards – Gathers test metrics from Azure Load Testing (ALT) or SBLoadGen, stores them in Azure Data Explorer (ADX), and displays trends in ADX dashboards.
Below is a simplified architecture diagram illustrating these components:
Components
Resource Creator
The resource creator uses meta files and Jinja templates to generate Bicep templates for creating resources.
- Meta Files: We define test scenarios in simple text-based files (e.g., os.txt, runtime_version.txt, sku.txt, scenario.txt). Each file lists possible values (like python|3.11 or dotnet|8.0) and short codes for resource naming.
- Template Generation: A script reads these meta files and uses them to produce Bicep templates—one template per valid combination—deploying receiver function apps into dedicated resource groups.
- Filters: Regex-like patterns in a filter.txt file exclude unwanted combos, keeping the matrix manageable.
- CI/CD Flow: Whenever we add a new runtime or region, a pull request updates the relevant meta file. Once merged, our pipeline regenerates Bicep and redeploys resources (these are idempotent updates).
Test Infra Generator
- Deploys and configures the Scheduler Function App, SBLoadGen Durable Functions app, and the ALT webhook function.
- Similar CI/CD approach—merging changes triggers the creation (or update) of these infrastructure components.
Test Infra: Load Generation, Scheduling, and Reporting
Scheduler
The conductor of the whole operation that runs every 5 minutes to load test configurations ( test_configs.json) from Blob Storage.
- The configuration includes details on what tests to run, at what time (e.g., “run at 13:45 daily”), and references to either ALT for HTTP or SBLoadGen for non-HTTP tests – to schedule them using different systems.
- Some tests run multiple times daily, others once a day; a scheduled downtime is built in for maintenance.
HTTP Load Generator – Azure Load Testing (ALT)
We utilize Azure Functions to trigger Azure Load Tests (ALT) for HTTP-based scenarios. ALT is a production-grade load generator tool that provides an easy to configure way to send load to different server endpoints using JMeter and Locust. We worked closely with the ALT team to optimize the JMeter scripts for different scenarios and it recently completed second year.
We created an abstraction on top of ALT to create a webhook-approach of starting tests as well as get notified when tests finish, and this was done using a custom function app that does the following:
- Initiate a test run using a predefined JMX file.
- Continuously poll until the test execution is complete.
- Retrieve the test results and transform them into the required format.
- Transmit the formatted results to the data aggregation system.
Sample ALT Test Run:
Some more details that we did within ALT –
- 25 Runtime Controllers manage the test logic and concurrency.
- 40 Engines handle actual load execution, distributing test plans.
- 1,000 Clients total for 5-minute runs to measure throughput, error rates, and latency.
- Test Types:
- HelloWorld (GET request, to understand baseline of the system).
- HtmlParser (POST request sending HTML for parsing to simulate moderate CPU usage).
Service Bus Load Generator – SBLoadGen (Durable Functions)
For event-driven scenarios (e.g., Service Bus–based triggers), we built SBLoadGen. It’s a Durable Function that uses the fan-out pattern to distribute work across multiple workers—each responsible for sending a portion of the total load. In a typical run, we aim to generate around one million messages in under a minute to stress-test the system. We intentionally avoid a fan-in step—once messages are in-flight, the system defers to the receiver function apps to process and emit relevant telemetry.
Highlights:
- Generates ~1 million messages in under a minute.
- Durable Function apps are deployed regionally and are triggered via webhook.
- Implemented as a Python Function App using Model V2.
Note: This would be open sourced in the coming days.
Receiver Function Apps (Test apps)
These are the actual apps receiving all the load generated. They are deployed with different combinations and updated rarely. Each valid combination (region + OS + runtime + SKU + scenario) gets its own function app, receiving load from ALT or SBLoadGen.
- HTTP Scenarios:
- HelloWorld: No-op test to measure overhead of the system and baseline.
- HTML Parser: POST with an HTML document for parsing (Simulating small CPU load).
- Non-HTTP (Service Bus) Scenario:
- CSV-to-JSON plus blob storage operations, blending compute and I/O overhead.
- Collected Metrics:
- RPS: Requests per second (RPS), success/error rates, latency distributions for HTTP workloads.
- MPPS: Messages processed per second (MPPS), success/error rates for non-HTTP (e.g. Service Bus) workloads.
Data Aggregation & Dashboards
Capturing results at scale is just as important as generating load. PerfBenchV2 uses a modular data pipeline to reliably ingest and visualize metrics from both HTTP and Service Bus–based tests.
All test results flow through Event Hubs, which act as an intermediary between the test infrastructure and our analytics platform. The webhook function (used with ALT) and the SBLoadGen app both emit structured logs that are routed through Event Hub streams and ingested into dedicated Azure Data Explorer (ADX) tables.
We use three main tables in ADX:
- HTTPTestResults for test runs executed via Azure Load Testing.
- SBLoadGenRuns for recording message counts and timing data from Service Bus scenarios.
- SchedulerRuns to log when and how each test was initiated.
On top of this telemetry, we’ve built custom ADX dashboards that allow us to monitor trends in latency, throughput, and error rates over time. These dashboards provide clear, actionable views into system behavior across dozens of runtimes, regions, and SKUs.
Because our focus is on long-term trend analysis, rather than real-time anomaly detection, this batch-oriented approach works well and reduces operational complexity.
CI/CD Pipeline Integration
Continuous Updates:
- Once a new language version or scenario is added to runtime_version.txt or scenario.txt meta files, the pipeline regenerates Bicep and deploys new receiver apps.
- The Test Infra Generator also updates or redeploys the needed function apps (Scheduler, SBLoadGen, or ALT webhook) whenever logic changes.
Release Confidence:
- We run throughput tests on these new apps early and often, catching any performance regressions before shipping to customers.
Challenges & Lessons Learned
Designing and running this infrastructure hasn’t been easy and we’ve learned a lot of valuable lessons on the way. Here are few
- Exploding Matrix – Handling every runtime, OS, SKU, region, scenario can lead to thousands of permutations. Meta files and a robust filter system help keep this under control, but it remains an ongoing effort.
- Cloud Transience – With ephemeral infrastructure, sometimes tests fail due to network hiccups or short-lived capacity constraints. We built in retries and redundancy to mitigate transient failures.
- Early Adoption – PerfBench was among the first heavy “customers” of the new Flex Consumption plan. At times, we had to wait for Bicep features or platform fixes—but it gave us great insight into the plan’s real-world performance.
- Maintenance & Cleanup – When certain stacks or SKUs near end-of-life, we have to decommission their resources—this also means regular grooming of meta files and filter rules.
Success Stories
- Proactive Regression Detection: PerfBench surfaced critical performance regressions early—often before they could impact customers. These insights enabled timely fixes and gave us confidence to move forward with the General Availability of Flex Consumption.
- Production-Level Confidence: By continuously running tests across live production regions, PerfBench provided a realistic view of system behavior under load. This allowed the team to fine-tune performance, eliminate bottlenecks, and achieve improvements measured in single-digit milliseconds.
- Influencing Product Evolution: As one of the first large-scale internal adopters of the Flex Consumption plan, PerfBench served as a rigorous validation tool. The feedback it generated played a direct role in shaping feature priorities and improving platform reliability—well before broader customer adoption.
Future Directions
- Open sourcing: We are in the process of open sourcing all the relevant parts of PerfBench – SBLoadGen, BicepTemplates generator, etc.
- Production Synthetic Validation and Alerting: Adapting PerfBench’s resource generation approach for ongoing synthetic tests in production, ensuring real environments consistently meet performance SLOs. This will also open up alerting and monitoring scenarios across production fleet.
- Expanding Trigger Coverage and Variations: Exploring additional triggers like Storage queues or Event Hub triggers to broaden test coverage. Testing different settings within the same scenario (e.g., larger payloads, concurrency changes).
Conclusion
PerfBench underscores our commitment to high-performance Azure Functions. By automating test app creation (via meta files and Bicep), orchestrating load (via ALT and SBLoadGen), and collecting data in ADX, we maintain a continuous pulse on throughput. This approach has already proven invaluable for Flex Consumption, and we’re excited to expand scenarios and triggers in the future.
For more details on Flex Consumption and other hosting plans, check out the Azure Functions Documentation. We hope the insights shared here spark ideas for your own large-scale performance testing needs — whether on Azure Functions or any other distributed cloud services.
Acknowledgements
We’d like to acknowledge the entire Functions Platform and Tooling teams for their foundational work in enabling this testing infrastructure. Special thanks to the Azure Load Testing (ALT) team for their continued support and collaboration. And finally, sincere appreciation to our leadership for making performance a first-class engineering priority across the stack.
Further Reading
- Azure Functions
- Azure Functions Flex Consumption Plan
- Azure Durable Funtions
- Azure Functions Python Developer Reference Guide
- Azure Functions Performance Optimizer
- Example case study: Github and Azure Functions
- Azure Load Testing Overview
- Azure Data Explorer Dashboards
If you have any questions or want to share your own performance testing experiences, feel free to reach out in the comments!