New in Microsoft AppSource: June 24-30, 2025
July 18, 2025The Adecco Group’s AI skill-building strategy powers talent, client impact
July 18, 2025In our journey building and operating Azure PaaS solutions under real-world pressure, we’ve encountered concurrency bugs, messaging quirks, and performance bottlenecks that only show up at scale.
When cloud applications scale, they stretch the limits and sometimes break in unexpected ways. This post distills those hard-earned lessons into practical guidance for developers and architects.
Each section pairs a real business scenario with a technical deep dive and Python-specific implementation tips, so you can not only understand the problem but also apply the fix.
Whether you’re handling resource leases (like blobs, storage etc), tuning Service Bus throughput, or navigating async pitfalls in Azure Functions, these insights will help you build systems that stay reliable even when the load spikes.
1. Blob Lease Management: Avoiding Concurrency Locks (Can apply to other resources too)
The issue: Under heavy load, it’s possible for multiple function instances to attempt processing the same blob nearly simultaneously. This can happen due to duplicate events or timing quirks.
To serialize access, a common approach is to use a blob lease essentially a lock on the blob – so that once a function instance acquires the lease, others will know not to process that blob.
However, improper handling of these leases can lead to a blob getting stuck in a locked state.
For example, if a function crashes or times out while holding an exclusive lease, and never releases it, no other instance can process that blob until the lease expires or is broken manually. In one real scenario, a bug left a blob with an infinite lease held, causing all subsequent attempts to process it to log “blob is already leased, skipping” and ultimately time out without ever freeing the blob.
Best practices for blob leases:
- Use finite lease durations: Avoid taking an infinite lease (lease_duration=-1). Instead, specify a reasonable lease duration, such as 15, 30, or 60 seconds. Finite leases automatically expire after that time if not renewed, so a stuck function will eventually release its lock. For instance, a 30-second lease provides a safety net – if a function dies or hangs, in 30 seconds another instance can automatically acquire the lease and retry. You can always renew the lease if your processing is still ongoing and needs more time, but never rely on an indefinite lock.
- Always release the lease in a finally block: Make absolutely sure that your code relinquishes the blob lease even if an error occurs. In Python, you might do:
lease = blob_client.acquire_lease(lease_duration=30)
try:
# … process the blob data …
finally:
lease.release()
If using the asynchronous SDK (BlobClient.acquire_lease as an async call), the same idea applies with await and an async with context or try/finally. The key is that any path through your function (success, error, cancellation) will execute the finally clause to release the lock. In scenarios where code was releasing the lease only in normal cases, a function that hit an exception or timeout skipped the release step – leading to an orphaned lock.
A finally block guarantees execution.
- Handle cancellation and timeouts: Azure Functions on a Consumption Plan will cancel a function invocation if it runs longer than the timeout (default 5 minutes for C#, 30 minutes for Python on Consumption; on Dedicated plans, the timeout is configurable or can be disabled). When a cancellation happens, your function might be interrupted (in Python this could manifest as an asyncio.CancelledError or even a GeneratorExit in a generator). To be safe, catch cancellation exceptions around your lease logic. For example:
try:
await blob_client.acquire_lease(…) # acquire lease
# do work
except asyncio.CancelledError:
logging.error(“Function was cancelled; releasing blob lease before exit.”)
# (The finally block will still run to release the lease)
raise
finally:
await lease.release()
Even with the finally, this pattern logs that a cancellation happened and ensures any other necessary cleanup can occur. The raise re-throws the cancellation so that the function truly stops. This way, if the host stops your function (say it hit a timeout), you still let go of the lease promptly.
- Break orphaned leases if needed: In a healthy design, using finite leases and always releasing them should prevent any perpetual locks. It’s good to implement monitoring for peace of mind. For example, you could have an admin script or a monitoring alert for blobs that remain leased for longer than a certain threshold (meaning something went wrong). Azure Storage allows you to break a lease manually if needed (e.g., via Azure Portal, CLI, or SDK). Breaking a lease immediately frees the blob, at the cost that any process holding the lease will get an exception if it tries to use it further. This is a last resort, but useful if you ever encounter a stuck lease that isn’t clearing on its own.
- Consider optimistic concurrency as an alternative: Blob leases are a pessimistic concurrency control – they prevent others from processing to ensure only one does the work. Another approach is optimistic concurrency, which doesn’t lock upfront but detects conflicts. For example, you could let multiple functions try to process the blob and use the blob’s ETag to allow only one to perform the final write or delete. In practice, one function would succeed in deleting the blob and the others would get a “precondition failed” error (or find the blob missing) and then skip further action. This avoids explicit locking altogether. The downside is those other functions might do redundant work (reading the blob) before discovering someone else already handled it. In high-volume systems, that extra churn can be wasteful, so leases (pessimistic) are fine as long as we manage them correctly. If your system is simpler or the cost of reprocessing is low, optimistic concurrency (via conditional delete or an atomic flag in a database) can be a simpler implementation that inherently handles duplicates. Either way, idempotency is crucial: design the processing so running it twice doesn’t harm (more on that later).
Summary: Use blob leases to protect against double-processing, but make them time-bound and failure-safe. A 30-second lease and a reliable release pattern will prevent the “stuck locked blob” scenario. Test your function’s behavior by simulating exceptions or slow execution to ensure the lease is released even then. By doing so, you can trust that even if two messages arrive for the same blob, one will get the lease and the other will safely back off and eventually no-op, without any long-term blockage.
2. Asynchronous Function Patterns and Pitfalls
Python Azure Functions can be written with async def for improved performance on I/O-bound work, but doing so introduces some complexity in ensuring everything runs smoothly under heavy load. Let’s address a few subtopics: proper logging and exception handling in async code, avoiding common mistakes like unawaited coroutines, and managing external service clients in an async context. We’ll also discuss an advanced technique of running a separate event loop on another thread – useful in certain scenarios to avoid blocking issues.
2.1 Robust Logging and Exception Handling for Async Functions
When running many asynchronous operations, it’s easy to lose track of errors if they aren’t handled properly. In our high-load scenario, some functions would fail or time out and later we’d see warnings in the logs such as:
- “Task was destroyed but it is pending!” – Meaning an asyncio.Task was cancelled or garbage-collected without finishing, often due to the event loop shutting down while it was still pending.
- “Unclosed client session” or “Unclosed connector” – Indicating that an async HTTP session (maybe from Azure SDK internals) wasn’t closed, likely because the function exited before cleanup.
- “RuntimeError: coroutine ignored GeneratorExit” – This can happen if a coroutine didn’t handle the generator close properly when the function was torn down (often a side effect of abrupt cancellation).
These messages are confusing because they sometimes appear after the function that caused them has ended. For example, a function might time out and you see only a generic “Function timed out” error for that invocation, but then one tick later the worker logs a “Task was destroyed…” warning – which is essentially the runtime telling you “hey, there was still a coroutine running which I had to kill.” To avoid such scenarios:
- Always await your coroutines: This sounds obvious, but ensure that every async function you call is either awaited or scheduled to run to completion. Do not fire-and-forget tasks in an Azure Function unless you manage their lifetime separately (which is advanced and not typical). If you need parallelism within a function, use await asyncio.gather(task1, task2, …) so that the function scope waits for all subtasks to complete. Unawaited coroutines will either never run or run in background without your knowledge, leading to unpredictable behavior.
- Use try/except around the main logic to catch errors: If an exception is thrown in your async function and not handled, the Azure Functions host will log it and mark the execution as failed. But you may want to add your own logging or cleanup. For example:
async def main_function_logic(msg):
try:
# … process message …
except Exception as e:
logging.exception(f”Error processing message {msg.id}: {e}”)
raise # rethrow to ensure the function is marked failed
- This way you log the error with context (maybe the message ID or blob name) before letting it propagate. Do not catch exceptions and then not rethrow – if you suppress the error, the function might think it succeeded and the message won’t retry. Catch exceptions mainly to log or perform specific actions, and then either handle them (if you can recover or choose to treat it as success) or rethrow to trigger a retry.
- Be mindful of function timeouts: If you are on a Dedicated plan, your functions might be configured with no timeout or a very high timeout. But a function hanging indefinitely is not good. If you know a particular operation might hang or take too long, consider adding your own timeout logic. Python’s asyncio.wait_for(coro, timeout) can wrap an await and throw a TimeoutError if it exceeds the limit. This can prevent your function from just freezing without feedback. For instance, if calling an external service, you might do await asyncio.wait_for(call_service(), 10) to limit it to 10 seconds. If it times out, handle that exception (log and raise, or maybe treat it as a failed item). It’s better to fail fast and let the message retry on a fresh instance than to tie up a worker for an extended period.
- Log cancellations and closures: As noted, catch asyncio.CancelledError in important places so you know if a function was cancelled by the host. This helps distinguish between “the logic threw an error” vs “the platform decided to stop execution due to timeout or shutdown.” For example:
try:
await do_work(…)
except asyncio.CancelledError:
logging.warning(“Function execution cancelled, likely due to timeout.”)
raise
That log will appear before the function ends, giving you a breadcrumb in case you see weird half-finished logs.
- Structured logging and correlation IDs: In a distributed, asynchronous system, it’s extremely helpful to include identifiers in your logs to trace a single transaction. Use a correlation ID if one is available (for instance, the Event Grid event ID, or the Service Bus message ID). You can pass it through to any downstream processing. Additionally, including the blob name or job ID in every log message related to that item means later you can filter logs easily. Azure Functions automatically logs an invocation ID per function call; you might integrate that if using Application Insights. If you are using an external logging system (like Elastic Stack, DataDog, etc.), attach those IDs as log fields. This way, even if an error surfaces later (“Task was destroyed”), you could check that it correlates with a prior function that timed out by matching an ID or timestamp.
- Use asyncio debug in non-production testing: Python’s asyncio library has a debug mode (set PYTHONASYNCIODEBUG=1 environment variable) which makes it log warnings for common mistakes, like forgetting to await a coroutine or not closing an asyncio.Stream. You wouldn’t enable that in production due to verbosity, but in a staging environment you could run a stress test with it on. It might pinpoint, for example, where an “unclosed session” was created (it often prints a source traceback for the creation of the object that wasn’t closed). This can lead you to the culprit (maybe an HTTP client or Azure SDK client that wasn’t closed properly).
- Close async clients/sessions in finally: Similar to blob leases, if you open any network connections – e.g., an aiohttp.ClientSession or Azure SDK client – close it when done. The Azure SDK often provides an async context manager for clients or a close method. For example, if you use EventHubProducerClient in an async function, call await client.close() in a finally block after sending messages. If you use it in a sync context, use it in a with block. Unclosed connections not only cause warnings but can exhaust resources (sockets) over time. In our scenario, making sure the Blob storage client and Event Hub clients were closed on every invocation was important to avoid connection leaks or hitting connection limits.
- Design for retries (idempotency): Since any failure triggers a retry from the Service Bus queue, your function will likely run again for the same item. It’s essential that repeating the work doesn’t produce invalid results or errors. This is called idempotent processing. For instance, if the first attempt already wrote an entry to the database, the second attempt might try to write a duplicate. To handle this, you could check if the entry exists before writing or use an “upsert” operation. Or if the first attempt deleted the blob, the second attempt will find no blob – that shouldn’t cause a crash, it should be handled gracefully (e.g., log “blob already processed, skipping”). We want the retry to effectively become a no-op if the previous attempt actually succeeded in the critical parts. In practice, implement checks like “if blob not found, skip” or “if record already in DB, skip writing” to make the function robust against multiple executions. Idempotency combined with reliable retries ensures eventual success without harm from duplicates.
2.2 Managing Azure SDK Clients and Connections
Another question that arises in Azure Functions development is how to manage connections to external services such as Azure Storage, Event Hubs, Service Bus, or databases. In a high-throughput scenario, you might call these services very frequently. Should you reuse client instances between function calls to save overhead, or create new ones each time to avoid concurrency issues? In .NET, the advice is usually to reuse, but in Python things differ due to the runtime model.
Per-invocation client instances vs. singletons:
- It is generally safer in Python to create new clients on each function invocation, rather than sharing a global client, especially for asynchronous clients. The reason is that Python’s Azure SDK clients (like BlobServiceClient, EventHubProducerClient, etc.) are not explicitly documented as thread-safe in all cases. Azure Functions might run multiple Python function invocations concurrently (different processes, or even threads in some scenarios). A shared client could end up being used simultaneously from two invocations, which can lead to race conditions or overlapping I/O. For example, one invocation might close the client while another is in the middle of sending data.
- The overhead of creating a new client object in Python is usually not huge – and can be mitigated. Connection pooling: Many Azure clients use connection pooling under the hood. If you create a new client but the underlying HTTP connection (for storage) or AMQP connection (for Event Hub/Service Bus) was recently opened, it might reuse the existing socket from the OS pool or keep-alive. That said, opening an Event Hub connection repeatedly for each message is not ideal. But if each function handles a batch of events (as you likely batch multiple events into one Event Hub send), the cost is amortized.
- Reuse credentials to avoid re-authenticating: One optimization is to instantiate your credential (like DefaultAzureCredential) once and reuse it. The credential object will cache tokens for its lifetime. For instance, at the global scope of your function file, do:
from azure.identity import DefaultAzureCredential CREDENTIAL = DefaultAzureCredential()
- Then inside your function, use this CREDENTIAL when creating clients, instead of making a new DefaultAzureCredential() each time. This way, you don’t perform a fresh Azure AD authentication handshake on every invocation. The token can be reused across invocations if valid, or the credential will handle refreshing it. This strikes a good balance: new service client, but shared token source.
- Example (Service Bus): If your function is triggered by Service Bus, you might still need a client to send messages out or to another queue. You could do:
from azure.servicebus.aio import ServiceBusClient
# … inside function:
client = ServiceBusClient(namespace, credential=CREDENTIAL)
async with client:
sender = client.get_queue_sender(‘myqueue’)
async with sender:
await sender.send_messages(messages)
# (Using context managers ensures the connection is closed after use)
- This creates a short-lived client for sending. Alternatively, you could keep a global ServiceBusClient and reuse it, but you must ensure it’s not used concurrently from multiple tasks. Given that Azure Functions may have multiple instances or processes, a global might end up being one per process anyway (which might be okay). It’s a bit complex to guarantee only one usage at a time.
- What about performance? You might worry that making new clients is slower. In practice, if your function is doing network I/O, the difference is usually negligible compared to overall work. If you ever find in profiling that a significant fraction of time is spent creating clients or establishing connections, you can adjust by either increasing the amount of work per function invocation (process multiple items in one call if possible) or tuning the environment (e.g., using a premium plan which might keep instances warm and reuse connections at OS level). But correctness comes first: eliminate any chance of thread contention or leftover state by isolating invocations.
- Don’t use module-level singletons unless you manage concurrency: A compromise advanced approach would be to have a global client and use a lock to ensure only one function uses it at a time. However, this effectively serializes those calls, negating much of the benefit of parallel functions. And if functions are running in separate processes (common for Python, where the Functions host might spawn multiple worker processes for parallelism), each process will have its own global anyway. So you gain little, except perhaps not having to reinitialize some state for each invocation.
- Compare to .NET Guidance: In .NET, e.g., a static HttpClient or a static Cosmos DB client is recommended to avoid socket exhaustion and improve performance. In Python, the default behavior of the SDK and interpreter (using separate processes for heavy load) means new objects per invocation are fine and often simpler. The Azure Python SDK team has indicated that their clients can be thread-safe in many cases, but due to Python’s GIL and multi-process scaling, the typical pattern isn’t to share a client widely as you might in .NET.
Bottom line: It’s acceptable and often advisable to create and dispose of SDK clients on each function call in Python. Keep usage of each client short-lived and wrap them in context managers or close them after use. The code is simpler to reason about (each function call is self-contained) and you avoid potential thread safety issues. If you notice any startup overhead, use techniques like caching the credential, or even caching things like a serialized configuration, but don’t cache live network connections without a clear need. If in the future the Python Functions runtime allows a better way to share state across invocations (like a persistent warm state or an injected singleton), you can reevaluate. Until then, favor simplicity and correctness.
2.3 Running Async Code in a Separate Thread (Advanced Scenario)
In rare cases, you might encounter a situation where you want to run an asynchronous workflow without blocking the main event loop of the Azure Functions host or to isolate it from the function’s own cancellation. One way to do this is to start a new event loop in a background thread. This is an advanced technique and most Azure Functions won’t need it, but it can be useful if you have a long-running async task that you don’t want to tie up the main function’s loop (or if you need to work around limitations in the Python Functions runtime’s event loop).
Consider an example: you have an Azure Function that, when triggered, needs to perform some asynchronous steps that could run independently. We can create a separate thread and run an event loop there to execute those steps, while the main thread waits for them to finish. Here’s a code snippet illustrating this pattern, along with logging to show which thread is doing what:
import azure.functions as func
import asyncio, threading, logging
# An example async task we want to run in isolation
async def async_task():
await asyncio.sleep(2)
return f”Async task completed in thread {threading.get_ident()}”
# Function to launch an event loop in a new thread and run the async_task
def run_in_separate_loop(result_holder):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop) # set the new loop as current in this thread
logging.info(f”[Child Thread] Event loop running in thread {threading.get_ident()}”)
result_holder[‘result’] = loop.run_until_complete(async_task())
loop.close()
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info(f”[Main Thread] Function running in thread {threading.get_ident()}”)
result_holder = {}
# Start the child thread to run the async task
thread = threading.Thread(target=run_in_separate_loop, args=(result_holder,))
thread.start()
thread.join() # wait for the thread to finish
return func.HttpResponse(f”Result: {result_holder[‘result’]}”)
When this function runs, the logs will show something like:
[Main Thread] Function running in thread 139956947121920
[Child Thread] Event loop running in thread 139956938729216
Notice that the thread IDs differ, proving the async task ran on the new thread. The main thread spawned the child, waited for it to finish (join), then continued. Essentially, we offloaded the async_task to a separate event loop so that even if the main thread’s event loop is doing other things, this work can happen in parallel (or at least independently).
When would you use this? One use case might be if the Azure Functions host’s event loop has some limitations or if you want to guarantee an async cleanup runs to completion even if the main function is cancelled. By running it in a separate thread, even if the main thread is terminated, the child thread might still get a chance to finish (though in Functions, if the process exits, threads will terminate too). Another scenario is integrating with code that expects to be run in the main thread’s event loop – rather than blocking it, you isolate an operation. This pattern is somewhat low-level; a simpler approach for long work is often to break it into smaller messages or use Durable Functions.
Caution: If you choose to do this, be very mindful of thread safety. You should pass any data needed into the thread (here we used a result_holder dict to get the result out). Don’t try to directly manipulate Azure Function’s runtime objects across threads. Also, spawning many threads for many concurrent executions could overwhelm the system – Azure Functions has a concurrency model managed by the host, and if you circumvent it by making threads, you could end up with too many threads. In short: use this technique sparingly and only if truly needed.
For most cases, sticking to the single event loop (and awaiting tasks properly) is sufficient and simpler. We included this example because it was part of our exploration for handling certain async tasks reliably. It demonstrates Python’s flexibility in Azure Functions to handle advanced concurrency patterns when required.
3. Event Grid and Service Bus: Configuring Reliable Messaging
The messaging backbone of our scenario is Azure Event Grid (which emits events from the ML jobs) and Azure Service Bus (which queues the events for function processing). These components are designed to be reliable and scalable, but they have knobs and behaviors to be aware of when operating at scale.
3.1 Handling Duplicate Events with Idempotency and Deduplication
At-least-once delivery: Azure Event Grid guarantees that it will deliver events to subscribers at least once. In rare cases, this can result in the same event being delivered twice or more. For example, out of 3,000 events, you might observe a handful of duplicates delivered. This typically happens if the subscriber (here, the Service Bus queue) doesn’t confirm receipt quickly enough or there’s a transient issue; Event Grid will retry the delivery and the subscriber ends up with a duplicate message.
In practice, this meant our blob-processing function occasionally got triggered twice for the same blob. If you have no safeguards, this could cause double-processing of data (which might corrupt results or at least waste effort).
Best practices for duplicates:
- Enable duplicate detection on Service Bus: Service Bus queues have a feature called duplicate detection which can automatically discard messages with duplicate IDs that arrive within a time window. To use this, you assign a unique Message ID to each message when it’s sent to the queue. In our case, if we ensure the Event Grid event’s unique ID is used as the Service Bus message ID, then any duplicate event (with the same ID) that Event Grid redelivers will be caught by Service Bus and dropped – the function will only see one message. Duplicate detection is configurable on the queue (you specify a time window, e.g., 5 minutes or 10 minutes, which is the span in which duplicate IDs are checked). This is a server-side de-duplication that greatly simplifies things. When setting up the Event Grid -> Service Bus integration, check if you can map the event ID to the message ID. If you’re using Infrastructure-as-Code (ARM, Bicep, Terraform), there are properties on the Event Grid subscription for Service Bus topics/queues to map certain fields.
- Make the function idempotent (again): Even with de-duplication, never assume duplicates won’t happen. It’s wise to design the function to handle a repeat message gracefully. We discussed idempotency above: if the blob’s already processed, the second attempt should detect that (maybe the blob is gone or a status flag is set) and simply exit without doing the whole work again. In our blob example, since the first run likely deletes or moves the blob at the end, the second run can check at start “does the blob exist?” – if not, log a message and short-circuit. Idempotency also covers scenarios beyond exact duplicates, such as retry after a partial failure.
- Contextual deduping: Sometimes, events may be similar but not byte-for-byte duplicates (like two events for the same blob from different triggers). Pure ID-based de-dup might not catch those. Depending on your scenario, you might implement a custom deduping logic, like keeping track of “seen” items in a short-term store (Cache, Redis, etc.). However, this is usually overkill if you have a handle on the main sources of duplicates (Event Grid retries). We mention it for completeness – in extreme cases of complex event flows, a custom dedupe layer might be needed.
Summary: Rely on Service Bus duplicate detection to automatically eliminate most duplicate events, and build your functions to be tolerant of processing the same item more than once. The combination ensures that a glitch in event delivery doesn’t result in incorrect outcomes or wasted resources.
3.2 Tuning Service Bus for High Throughput and Reliability
Azure Service Bus is a fully managed message queue and is central to connecting your workflow steps. Here are some best practices and considerations for using Service Bus in this context:
- Prefetch messages to boost throughput: By default, a Service Bus trigger in Azure Functions will fetch and process messages one by one. If your processing is very quick and you want to increase message throughput per function instance, you can enable prefetch so that each function instance retrieves a batch of messages in advance. This reduces latency between finishing one message and starting the next. In your function’s host.json you can configure the prefetchCount for the Service Bus extension. For example, setting prefetchCount to 10 means the SDK will grab up to 10 messages and keep them in memory, feeding your function one after the other without going back to the server each time.
Be cautious: if your function can actually process messages in parallel (e.g., you spawn tasks for them), you might want to handle multiple at once, but Python functions typically process one at a time unless you explicitly write them to do otherwise. Prefetching is still useful to keep the pipeline busy.
- Max concurrent calls per instance: By design, the Python Functions worker processes messages one at a time (to avoid GIL limitations). If you want to utilize more parallelism on a single machine, the Functions runtime allows multiple worker processes (set FUNCTIONS_WORKER_PROCESS_COUNT) –
e.g. 4 processes on a 4-core machine, which could handle 4 messages at once, each in its own process. This is effectively what scaling out on multiple instances does, but within one VM. You can experiment with this if you find the single-process throughput is not enough and scaling out isn’t immediately happening. However, increasing worker processes will also multiply memory usage accordingly.
- Adjusting Max Delivery Count: We mentioned that the queue retries messages a certain number of times (the MaxDeliveryCount property, default 10). If a message fails that many times, it goes to the Dead Letter Queue (DLQ). Make sure this value is appropriate – 10 is fine in many cases. If failures are rarely transient and usually indicate a bad message, you might lower it to reduce repeated attempts. Conversely, if failures are often due to temporary issues (like a database throttle) and usually succeed on, say, the 8th try, you could leave it or even raise it. Monitoring your DLQ will inform this; if you see messages landing in DLQ that could have succeeded with more retries, consider adjusting or implementing a secondary retry mechanism for DLQ items.
- Dead-letter handling: Any message that ends up in the DLQ implies it was never successfully processed. You should have a plan for DLQ messages. This could be an automated Function that monitors the DLQ and alerts someone, or even reprocesses them after a delay (though careful: if it failed 10 times, automatic reprocessing without changes might just repeat the failure). Often, DLQ items need manual inspection to determine why they failed (e.g., data format issues, unhandled exceptions). Make it easy to examine them – for instance, you can write a script or use Azure Service Bus Explorer tools to view DLQ messages. At minimum, set up an alert in Azure Monitor if the DLQ length grows above 0 or above a small threshold, so you know something got stuck.
- Complete vs. Abandon messages: The Service Bus trigger will by default auto-complete the message if your function finishes without errors. If an error is thrown, it will automatically abandon (release) the message, so it becomes visible again for retry. This is typically what you want. If you need more fine-grained control (say you want to defer or dead-letter a message manually under certain conditions), you can use the Service Bus SDK in the function to explicitly do so. For example, you could catch a particular exception and call message.dead_letter(reason=”…”, description=”…”) to send it directly to DLQ without more retries, if you know it’s a poison message that will never succeed. Use such patterns carefully. In most cases, letting it retry and then DLQ is simplest. But for scenario-specific errors (like “schema invalid”), immediate dead-lettering might spare the system extra load.
- Ordering and sessions: By default, messages in a Service Bus queue may be processed out of order relative to how they were sent, especially with multiple consumers. If ordering is important, Service Bus offers Sessions – each session ID defines an ordered FIFO stream. The function configuration can ensure only one session is processed at a time per instance. In our scenario, if each machine learning job result is independent, ordering isn’t a concern. However, if there was a sequence (job start -> job end) and they came as separate messages, you might want them in order. You could then, for example, use the job ID as a Session ID, and the Service Bus guarantee that for each job ID, the messages are processed sequentially. This is an advanced use of Service Bus but worth knowing. It can simplify handling of stateful sequences at the cost of a bit of throughput (since it serializes those sequences).
- Scaling out the function app: Ensure your Azure Function app’s plan is capable of scaling to meet the queue demand. If using a Dedicated App Service Plan, configure autoscale rules (based on CPU or queue length) to add more instances during the nightly peak. If using Consumption or Premium plan, the platform scales automatically, but Premium allows you to have a minimum pre-warmed instance count which can help with spikes. Monitor the “Queue Length” and “Active Message Count” on your Service Bus. Ideally, after the batch of jobs completes, the queue should drain to zero reasonably fast. If it lags behind significantly, that’s a sign you need either more function instances or each function needs to handle messages faster (or possibly the bottleneck is downstream like the database). Scale-out is usually the solution – add instances until the throughput matches incoming rate. The nice part of the queue is it will just buffer until consumers catch up, without losing messages.
- Timeouts in processing vs. abandon: If your function takes too long (exceeding the SB message’s lock timeout), Service Bus might give the message to another consumer even while the first is still working – leading to parallel double-processing. By default, the Functions runtime renews the lock periodically as long as the function is running, so this normally doesn’t happen unless something blocks the renewal. But be aware: if you have extremely long processing per message, you might need to increase the message lock duration setting on the queue, or use automatic lock renewal (the SDK can do that). In our design, we tried to keep each message processing fairly quick (a few seconds). Only a misbehaving function would hit the default 30-second lock and require renewal. Ensure that the ServiceBus trigger is configured to automatically renew locks (Functions typically does this for you if the processing goes beyond a certain threshold). If not, consider using the MessageReceiver.renew_lock in the SDK if doing manual receive.
In summary, Service Bus is your buffer and control point – use its features to make your system resilient:
- Turn on duplicate detection to filter out repeats.
- Let it retry messages on transient errors, and monitor the dead-letter queue for ones that never succeed.
- Scale out your processing power to handle peaks in queue length.
- Tweak concurrency and prefetch if you need to increase throughput on each instance (but test carefully, especially with Python’s concurrency model).
- Use sessions or multiple queues if you need to shard or order work.
- Keep an eye on any timeouts or lock renewal issues for long processes (and adjust configurations accordingly).
These settings will help ensure your functions can keep up with the event firehose without dropping messages or processing things twice.
4. Architectural and Operational Improvements for Scale
Beyond code and configuration tweaks, it’s important to consider the overall architecture and how the components work together under stress. Here we discuss some architectural best practices and operational considerations to ensure the solution scales and remains reliable long-term.
4.1 Scaling the Azure Functions Environment
- Choose the right hosting plan: Azure Functions offers multiple hosting options. For sporadic workloads, the Consumption Plan is cost-effective and auto-scales, but it has startup latency (cold starts) and a 10-minute execution time limit (which can be extended for premium, but still). For our scenario with a daily heavy batch, a Premium Functions Plan or a dedicated App Service Plan is more suitable. These avoid cold start issues by keeping function instances warm, and they can be configured for longer or no timeouts. We used a dedicated App Service Plan, which essentially runs the functions on reserved VMs – good for consistent high volume. Premium plan is an alternative that can scale out more quickly (and even scale to zero when idle). The key is to ensure the plan has enough memory and CPU. High memory is useful if each function uses a lot (e.g., loading ML models into memory). High CPU count is useful if you plan to increase concurrency with multiple processes or threads. Monitor your metrics: if memory is maxing out, consider a higher SKU; if CPU is maxing out, scale out or up.
- Pre-scale for known events: If you know the spike happens at 3:00 AM daily, you can proactively scale or at least set rules to scale out at 2:50 AM. In an App Service Plan, you can use scheduled autoscale: e.g., “at 2:45 AM, set instance count to 5” then “at 4:00 AM, scale back down to 2”. This ensures capacity is ready when the avalanche of events hits. If using Premium, you can set a minimum instance count around that time or even programmatically trigger a warm-up (some use a dummy trigger to activate instances before load). If using Consumption, there’s less control, but Microsoft does automatically scale – just sometimes there’s a lag of a minute or two to realize “oh, need more instances.”
- Partition workloads if necessary: If one function or one queue becomes a bottleneck, consider splitting the work. For example, if you had two distinct types of jobs or datasets, you might use two separate queues and functions, allowing them to scale independently. This reduces contention and makes each pipeline narrower. In our scenario, if an average of 3000 events is fine on one queue, we don’t need to split. But if it grew to 30,000, we might allocate, say, 3 queues each handling 10,000 (maybe partitioned by region or by job type). This is more of a sharding strategy to scale horizontally beyond a single queue’s throughput or a single function’s processing rate.
- Use multiple function apps for isolation: There’s nothing wrong with putting multiple related functions in the same Function App – it’s convenient and they can share resources. However, sometimes isolating critical pieces into separate Function Apps is beneficial. For instance, the function that triggers on Event Grid (if any) could be in one app, the function that processes from Service Bus in another, and maybe any timer-triggered cleanup function in yet another. Separate apps mean one can be restarted, deployed, or scaled without affecting the others. It also means they don’t compete for the same instance’s CPU/memory. On the flip side, separate apps can’t share in-memory caches, and deployment is a bit more complex (coordinating versions). Evaluate your scenario: if the functions are tightly related and share code, keep them together; if not, or if one is causing issues for others, splitting can improve reliability.
- Durable Functions vs. event choreography: We have an event-driven (choreography) architecture: each component reacts to an event and triggers the next. Another approach would be to use Azure Durable Functions to orchestrate the workflow in code (or Logic Apps for a low-code approach). Durable Functions (DF) can manage the fan-out of tasks and their results using an orchestrator function. For example, when an ML job is initiated, a DF orchestration could wait for a completion event (using an external event trigger) then call an activity to process the result, etc., all within one logical workflow instance. DFs provide automatic retry, state management, and a unified view of the process. However, they also have overhead and some limits on scale (managing 3000 parallel orchestrations is possible but requires careful planning for storage and throughput in DF’s backend). In our case, we chose a simpler decoupled approach – which is perfectly valid and often more transparent (each queue and function does one thing). There’s no single “right” answer; just know that if coordination logic becomes complex, DFs or Logic Apps are tools to consider. If what you have now works and is maintainable, that’s great.
- Keep functions focused and fast: A general scaling guideline is keep each function’s execution time and scope minimal. It’s better to have 100 short-lived function executions than 10 that each do 10 times more work – mainly because shorter tasks parallelize and recover from failure more gracefully. If you find a function is doing too much (e.g., processing 100 messages in one go and taking 10 minutes), consider breaking the work (maybe let each message be processed individually, or break the blob into smaller pieces processed by separate events). This might increase total function count but it improves throughput and fault tolerance (one failure affects a smaller chunk of work).
4.2 Ensuring Data Consistency and Integrity
Scaling isn’t just about speed; it’s also about making sure data remains correct and consistent in all the moving parts:
- Don’t delete your only copy too soon: If your pipeline moves data from blob to database, think about what happens if the final step fails. In our scenario, the function reads a blob (result), then perhaps sends it to another service or inserts into MongoDB, and then deletes the blob. If the MongoDB insert fails after the blob is gone, you’ve lost that data (unless you have backups elsewhere). A safer design is to only delete the blob after confirming the data is safely in the next store. One way is to perform the database insert first, then delete the blob at the very end of a successful run. But if the message is retried, that might insert again (unless you handle duplicates on DB side). Another approach is to soften deletion: e.g., add a metadata tag “processed=true” or move the blob to a “processed” folder rather than hard delete. Then have a separate cleanup job that deletes all blobs marked processed after X days. This way, if something goes wrong, you can reprocess from the blob. This is essentially making your system replayable. Of course, it depends on how critical the data is and how much storage overhead you can handle.
- Consistency between steps: Ensure each event or message carries enough information to perform the step independently. This avoids tight coupling. For example, the Service Bus message could contain the blob URL and maybe a job ID, but the function shouldn’t need to, say, query some external state to know what to do – it should be fully contained. This appears to be the case in our design (the message includes needed identifiers). Such self-containing helps consistency because it reduces the chance of disagreements between components about what’s being processed.
- Transactional updates when possible: When writing to multiple outputs (perhaps writing to database and sending another event), consider what happens if one succeeds and the other fails. If the system cannot easily roll back, you might have partial results. In these cases, you might implement a two-phase approach: e.g., write to the database first, then in the same function invocation send an event to confirm success which triggers a cleanup. Or use the database write itself as a trigger for the next stage (so you don’t send an event until DB is done). It might not be feasible to get perfect ACID transactions across all parts, but aim for at least once with idempotency as the general principle, which we’ve done.
- Monitor and reconcile: In a complex pipeline, it can be very useful to have a reconciliation process that double-checks things. For example, after the nightly run, run a small job that counts how many results were expected vs. how many ended up in the database. If there’s a mismatch, at least raise an alert (or automatically try to find which ones missing and reprocess them if possible). This kind of backstop catches any silent failures that might slip through the cracks. It’s akin to an accounting system making sure debits equal credits in the end.
- Preserve order when needed: As mentioned, if certain events must be processed in sequence (like a workflow that has stages), you might need to enforce ordering. Strategies include using a single consumer (which we did with leases for a blob – effectively ensuring one at a time), using Service Bus sessions (if using SB for ordering), or simply building logic to ignore out-of-order events (like if an “end” event arrives before “start”, hold it until start is done). In our case, each event was independent so we didn’t need ordering.
4.3 Monitoring, Alerting, and Security
Monitoring and alerting:
A scalable system isn’t complete without good monitoring. Azure provides Application Insights for rich telemetry. By integrating App Insights (or another logging/monitoring tool of your choice), you can track metrics like function execution count, failure rate, average processing time, memory/CPU usage of the function app, queue lengths, etc. Set up alerts for key conditions:
- Function failures: e.g., alert if more than X functions fail in an hour, or if any single failure has an exception type that’s unexpected.
- Queue backlog: alert if the Service Bus queue length stays above, say, 1000 for more than 10 minutes (meaning the system is falling behind).
- Dead letter: alert if any message lands in the dead-letter queue.
- Event Grid delivery failures: Event Grid has metrics for failed deliveries (if it can’t deliver to Service Bus at all). Rare, but set an alert for it if possible, so you know if events aren’t flowing.
- Infrastructure metrics: CPU, memory on the function app instances – if consistently high, maybe time to scale out or up.
Use dashboards to visualize these during the run. For example, a dashboard showing “incoming events vs processed events vs queue length over time” can immediately tell you if the system is coping or lagging.
Security considerations:
Scalability shouldn’t come at the cost of security. Some best practices:
- Managed identities and least privilege: Use an Azure Managed Identity for the Function App to access Azure resources (like Storage, Service Bus). This avoids putting connection strings or keys in config. It looks like we already use DefaultAzureCredential, which likely uses a managed identity to access those services. Ensure the identity has just the needed roles: e.g., read/write to the specific blob container, send/receive on the specific Service Bus, etc. If using a third-party DB (MongoDB), store its credentials securely (Key Vault or Azure Config with encryption) and not in plain text in code.
- Network security: If possible, enable Virtual Network integration for the function app and use private endpoints for Storage, Service Bus, etc. This means the function communicates with those services over Azure’s internal network, not via public endpoints, reducing exposure. If that’s too complex, at least ensure your service keys are not exposed and consider using firewall rules to restrict who can access the Storage or Service Bus (e.g., limit to Azure datacenter IPs or your VNet).
- Data encryption: Azure Storage and Service Bus encrypt data at rest by default. If you have specific encryption requirements (like customer-managed keys), implement them. For MongoDB, use TLS for the connection. If you store sensitive data in blobs, consider an additional encryption at the application level (though probably not needed for this case of ML results).
- Secret management: Use Azure Key Vault for any secrets. If your function uses, say, a connection string for MongoDB, pull it from Key Vault at startup using the managed identity. This way, rotations of secrets don’t require code changes, and you minimize the risk of leakage.
- Principle of least privilege: Ensure that any identity or key used cannot do more than necessary. E.g., the Service Bus connection used by Event Grid to put messages should only have send rights on that queue, not manage rights on the whole namespace. The function’s identity should not be an owner of the subscription – it only needs specific roles.
Compliance and readiness:
- If your scenario involves personal data or sensitive info, be mindful of compliance (GDPR etc.), e.g., implement deletion of data after some retention period, secure data in transit (which Azure does by default via HTTPS and AMQP over TLS).
- Ensure you have a backup or export of critical data (if the final results are in a database, back that up regularly).
- Test your disaster recovery: what if the whole region has an outage? Can you rerun the jobs in another region? This might be beyond scope, but just keep in mind depending on how critical the process is.
Finally, keep your team prepared: Document the system, including how to deploy it, how to monitor it, and how to recover from failures. With lots of moving parts (ML jobs, Event Grid, Service Bus, Functions, DB), having a runbook for troubleshooting is invaluable. For example, “If processing is slow, check X, Y, Z metrics; if duplicate events are too many, check Event Grid delivery logs; if the database is the bottleneck, consider scaling it or throttling the input,” and so on.
Conclusion
Designing an Azure Functions-based pipeline for high throughput requires attention to detail in concurrency control, error handling, and system configuration. By applying these best practices, you can significantly improve the reliability and scalability of your serverless application:
- Concurrency management: Use blob leases or other locking only when needed, and do so safely (finite auto-expiring leases, always release in finally). This prevents stuck resources and ensures only one function handles a particular item at a time when required.
- Async best practices: Fully embrace Python async by awaiting all tasks, handling exceptions and cancellations, and cleaning up resources. Avoid common pitfalls like unclosed sessions or tasks running beyond function scope. Good logging and correlation IDs will save you hours of debugging when something goes wrong at scale.
- Messaging resilience: Configure Event Grid and Service Bus for at-least-once delivery by handling duplicates and using duplicate detection. Make your processing idempotent so retries (which will happen) do not cause adverse effects. Utilize dead-letter queues and monitoring to catch any messages that can’t be processed and need attention.
- Scalability and performance: Scale out your function app to meet demand, using autoscaling or manual rules. Keep functions short and focused for better parallelism. If necessary, partition the workload or use multiple function apps to avoid any single bottleneck. Consider advanced patterns (like Durable Functions) if workflow coordination becomes complex, but otherwise keep the event-driven design which is naturally scalable.
- Operational excellence: Set up proper monitoring, alerting, and logging from end to end. Know your system’s typical behavior and have alerts for when it deviates. Test failure scenarios so you’re prepared (e.g., simulate a downstream outage and see that messages indeed queue up and recover).
- Security and configuration: Use managed identities and secure your connections. Limit access rights and network exposure of each component. Keep secrets out of code. Ensure compliance with data handling and retention policies.
By following these guidelines, you build a solution that not only handles the current load but can gracefully scale for more, and most importantly, won’t break in hard-to-debug ways under pressure. In our journey, implementing these changes (like fixing the blob lease usage and adding duplicate filtering) turned a fragile process into a robust one that can process thousands of events consistently every time. Your Azure Functions can achieve the same level of resilience, making you confident to rely on them for critical workloads. Happy optimizing!