A scalable and efficient approach for ingesting medical imaging data using DICOM data transformation

Comparing deployment options of AKS enabled by Azure Arc 📃

April 29, 2025

Add-ins and more – WordPress on App Service

April 29, 2025

Published by azurefeeds on April 29, 2025

Ingestion patterns

The capability provides various ingestion mechanisms for processing medical images based on the use case. For processing small datasets, comprising of medical images not more than 10 million at once, customers can either choose to upload their data to Fabric’s OneLake or connect an external storage to Fabric.

Let’s try to understand the rationale behind the 10 million image limit. Both the ingestion options as mentioned above setup spark structured streaming on the input DICOM® files. During file listing, one of the steps before the start of the actual processing, spark tries to gather the input files metadata like file paths, timestamps associated with the files etc. When dealing with millions of files, file listing process itself is split across multiple executor nodes. Once the metadata is fetched by the executor nodes, it is collected back at the driver node for making the execution plan.

The memory allocated for storing file listing results is controlled by spark property called driver.maxResultSize. The default value for this configuration can vary based on the platform – spark in Microsoft Fabric defaults it to 4GB. Users can estimate the results collected at the driver for an input dataset by understanding the input folder structure for file paths and keeping some buffer for overhead. They need to make sure the file listing result is not more than the allocated space (4GB) to avoid Out of Memory (OOM) issues. Based on our experiments, it turned out that 10 million limit on the number of input files will give a reliable and successful run with the aforesaid memory.

Now, driver.maxResultSize is a spark configuration, and it can be set to higher value to increase the allocated space for collecting file listing results. Based on spark architecture, driver memory is split into multiple portions for various purposes. Users need to be careful while increasing the value for this property as it can hamper other functioning of the spark architecture. Refer to the below table for more details on tuning the property appropriately in Microsoft Fabric.

Note: Below is a reference table illustrating how various node sizes and configurations impact the file listing capacity and result sizes. The table presents rough estimates based on a test dataset deployed in the default unified folder structure in HDS. These values can serve as an initial reference but may vary depending on specific dataset characteristics. Users should not expect identical numbers in their own datasets or use cases.

Node size	Available memory per node (GB)	Driver node vCores	Executor node vCores	Driver.maxResultSize (GB)	File paths size (GB)	Files listed (millions)
Medium	56	8	8	4	3.38	10.8
Large	112	8	16	8	6.75	21.6
XL	224	8	32	16	12.81	41
XXL	400	16	64	24	15.94	51

Emergence of Inventory based ingestion

Microsoft Fabric provides a variety of compute nodes which can be utilized for different use cases. The highest configuration node can be XX-Large with 512GB memory and 64 vCores. Even with such a node configuration, we can increase driver.maxResultSize to a certain limit. Thereby, posing a restriction on the dataset size which can be ingested in a single run.

One way to tackle this problem is to segregate the entire dataset into smaller chunks, which is exactly the purpose of having the unified folder structure in HDS where data is segmented by default into date folders, such that the file listing result for a single chunk is within the limits of allocated memory. However, it might not be feasible to make changes every time at the data source. This is where HDS Inventory Based Ingestion comes into play, enabling the scalable ingestion of DICOM® imaging files into Fabric.

Inventory based ingestion is built on an approach of segregating the file listing step from the core processing logic. This means, given the files metadata information aka inventory files, which is analogous to file listing result, users don’t need to setup the spark streaming on the folder containing DICOM® files directly rather they can consume the metadata information from inventory files and initiate the core processing logic. This way we avoid the OOM issues arising due to file listing.

In case your data resides in Azure gen2 storage account, there is an out of the box service called Azure storage blob inventory to generate inventory files in parquet format. However, while inventory based ingestion does support other storages as well, users need to provide the inventory files in a required format and follow some minimal configuration changes.

Capability configurations

This capability includes various configuration levers which can be configured by updating deployment parameters config to tune the ingestion process for better throughput.

max_files_per_trigger – this is an interface for using maxFilesPerTrigger in spark structured streaming. It is defaulted to 100K. For inventory-based ingestion, it is advisable to lower down this number to either 1 or 2 based on number of records contained in each parquet file.

max_bytes_per_trigger – this is an interface for using maxBytesPerTrigger in spark structured streaming. This option doesn’t work directly with all input files as source. However, it works on parquet files as source and thus becomes relevant when using Inventory based ingestion. This is defaulted to 1GB.

rows_per_partition – this option is specifically designed for Inventory based ingestion, where the number of default partitions might not be efficient for full utilization of available resources. In a given execution batch, the number of input files is divided by this number to repartition the dataframe. Default value is 250. This implies, let’s say if the current batch size is 10Million then it would create 10Million/250 = 40k partitions which translates to 40k spark tasks.

DICOM® is the registered trademark of the National Electrical Manufacturers Association (NEMA) for its Standards publications relating to digital communications of medical information.

Medical device disclaimer: Microsoft products and services (1) are not designed, intended or made available as a medical device, and (2) are not designed or intended to be a substitute for professional medical advice, diagnosis, treatment, or judgment and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment. Customers/partners are responsible for ensuring solutions comply with applicable laws and regulations.

Comparing deployment options of AKS enabled by Azure Arc 📃

Add-ins and more – WordPress on App Service

Comparing deployment options of AKS enabled by Azure Arc 📃

Add-ins and more – WordPress on App Service

Ingestion patterns

Emergence of Inventory based ingestion

Capability configurations

Related posts

Recently released: Updates to the SqlPackage and the DacFx ecosystem

Placeholders get a makeover in PowerPoint

Azure virtual network terminal access point (TAP) public preview announcement