Comparing deployment options of AKS enabled by Azure Arc 📃
April 29, 2025Add-ins and more – WordPress on App Service
April 29, 2025The transformation of Digital Imaging and Communications in Medicine (DICOM®) data is a crucial capability in healthcare data solutions (HDS). This feature allows healthcare providers to bring their DICOM® data into Fabric OneLake, enabling the ingestion, storage, and analysis of imaging metadata from various modalities such as X-rays, Computerized Tomography (CT) scans, and Magnetic Resonance Imaging (MRI) scans. By leveraging this capability, healthcare organizations can enhance their digital infrastructure, improve patient care, and streamline their data management processes.
Ingestion patterns
The capability provides various ingestion mechanisms for processing medical images based on the use case. For processing small datasets, comprising of medical images not more than 10 million at once, customers can either choose to upload their data to Fabric’s OneLake or connect an external storage to Fabric.
Let’s try to understand the rationale behind the 10 million image limit. Both the ingestion options as mentioned above setup spark structured streaming on the input DICOM® files. During file listing, one of the steps before the start of the actual processing, spark tries to gather the input files metadata like file paths, timestamps associated with the files etc. When dealing with millions of files, file listing process itself is split across multiple executor nodes. Once the metadata is fetched by the executor nodes, it is collected back at the driver node for making the execution plan.
The memory allocated for storing file listing results is controlled by spark property called driver.maxResultSize. The default value for this configuration can vary based on the platform – spark in Microsoft Fabric defaults it to 4GB. Users can estimate the results collected at the driver for an input dataset by understanding the input folder structure for file paths and keeping some buffer for overhead. They need to make sure the file listing result is not more than the allocated space (4GB) to avoid Out of Memory (OOM) issues. Based on our experiments, it turned out that 10 million limit on the number of input files will give a reliable and successful run with the aforesaid memory.
Now, driver.maxResultSize is a spark configuration, and it can be set to higher value to increase the allocated space for collecting file listing results. Based on spark architecture, driver memory is split into multiple portions for various purposes. Users need to be careful while increasing the value for this property as it can hamper other functioning of the spark architecture. Refer to the below table for more details on tuning the property appropriately in Microsoft Fabric.
Note: Below is a reference table illustrating how various node sizes and configurations impact the file listing capacity and result sizes. The table presents rough estimates based on a test dataset deployed in the default unified folder structure in HDS. These values can serve as an initial reference but may vary depending on specific dataset characteristics. Users should not expect identical numbers in their own datasets or use cases.
Node size |
Available memory per node (GB) |
Driver node vCores |
Executor node vCores |
Driver.maxResultSize (GB) |
File paths size (GB) |
Files listed (millions) |
Medium |
56 |
8 |
8 |
4 |
3.38 |
10.8 |
Large |
112 |
8 |
16 |
8 |
6.75 |
21.6 |
XL |
224 |
8 |
32 |
16 |
12.81 |
41 |
XXL |
400 |
16 |
64 |
24 |
15.94 |
51 |
Emergence of Inventory based ingestion
Microsoft Fabric provides a variety of compute nodes which can be utilized for different use cases. The highest configuration node can be XX-Large with 512GB memory and 64 vCores. Even with such a node configuration, we can increase driver.maxResultSize to a certain limit. Thereby, posing a restriction on the dataset size which can be ingested in a single run.
One way to tackle this problem is to segregate the entire dataset into smaller chunks, which is exactly the purpose of having the unified folder structure in HDS where data is segmented by default into date folders, such that the file listing result for a single chunk is within the limits of allocated memory. However, it might not be feasible to make changes every time at the data source. This is where HDS Inventory Based Ingestion comes into play, enabling the scalable ingestion of DICOM® imaging files into Fabric.
Inventory based ingestion is built on an approach of segregating the file listing step from the core processing logic. This means, given the files metadata information aka inventory files, which is analogous to file listing result, users don’t need to setup the spark streaming on the folder containing DICOM® files directly rather they can consume the metadata information from inventory files and initiate the core processing logic. This way we avoid the OOM issues arising due to file listing.
In case your data resides in Azure gen2 storage account, there is an out of the box service called Azure storage blob inventory to generate inventory files in parquet format. However, while inventory based ingestion does support other storages as well, users need to provide the inventory files in a required format and follow some minimal configuration changes.
Capability configurations
This capability includes various configuration levers which can be configured by updating deployment parameters config to tune the ingestion process for better throughput.
- max_files_per_trigger – this is an interface for using maxFilesPerTrigger in spark structured streaming. It is defaulted to 100K. For inventory-based ingestion, it is advisable to lower down this number to either 1 or 2 based on number of records contained in each parquet file.
- max_bytes_per_trigger – this is an interface for using maxBytesPerTrigger in spark structured streaming. This option doesn’t work directly with all input files as source. However, it works on parquet files as source and thus becomes relevant when using Inventory based ingestion. This is defaulted to 1GB.
- rows_per_partition – this option is specifically designed for Inventory based ingestion, where the number of default partitions might not be efficient for full utilization of available resources. In a given execution batch, the number of input files is divided by this number to repartition the dataframe. Default value is 250. This implies, let’s say if the current batch size is 10Million then it would create 10Million/250 = 40k partitions which translates to 40k spark tasks.
DICOM® is the registered trademark of the National Electrical Manufacturers Association (NEMA) for its Standards publications relating to digital communications of medical information.
Medical device disclaimer: Microsoft products and services (1) are not designed, intended or made available as a medical device, and (2) are not designed or intended to be a substitute for professional medical advice, diagnosis, treatment, or judgment and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment. Customers/partners are responsible for ensuring solutions comply with applicable laws and regulations.