Building Reliable AI Coding Workflows Using Modular AI Agent Optimization
May 29, 2026Azure Native Integrations: Public Preview of Napster Companion API on Azure
May 29, 2026Authors:
- Jared Erwin, Senior Software Engineer, HLS Nursing AI and Data Platform, Faculty UW School of Medicine
- Manoj Kumar, Director, HLS – Data & AI HLS Frontiers AI
- Alberto Santamaria-Pang, Principal Applied Data Scientist, HLS Frontiers AI and Adjunct Faculty, Johns Hopkins Medicine
Overview
In Part 1, of this series, we showed how natural language could be used to define medical imaging cohorts and retrieve relevant studies in seconds instead of months. That proof-of-concept demonstrated the value of the idea — but not how to make it repeatable, or production-ready.
This post focuses on how we turned that prototype into a production-oriented Azure Machine Learning pipeline — to scale execution and produce clear, versioned artifacts that could drive an interactive cohort exploration UI.
If you’re building ML pipelines for medical imaging, or any domain where data is large, messy, and locked behind access controls, we hope our experience saves you time.
From scripts to a pipeline: Why Azure ML components?
The original hackathon implementation consisted of notebooks and scripts that required careful manual execution. To make the system repeatable and auditable, we standardized it using Azure ML pipelines.
Azure ML pipelines gave us:
- Componentized execution — each processing step is a self-contained unit with defined inputs, outputs, and dependencies
- Parallel branches — steps that don’t depend on each other run concurrently
- Reproducibility — every run is versioned and logged with full lineage
- Compute flexibility — run on CPU for metadata extraction, GPU for model inference, without manual orchestration
The pipeline architecture
The pipeline consists of 5 python components arranged in a DAG with two parallel branches:
- [0]scans a DICOM directory and extracts metadata from headers — study/series UIDs, modality, body part, slice counts.
- [1]classifies each series by anatomy and orientation using a multi-tier strategy (more on this below).
- [2] and [3] form the search pipeline: anatomy labels are converted to natural language text templates, then encoded with BiomedCLIP into a FAISS vector index.
- [4]generates 2D UMAP coordinates from the embeddings for the interactive scatter plot visualization in the UI.
The image depicts a flowchart detailing the process of DICOM metadata extraction, anatomy classification, visualization enrichment, and text template generation, followed by the creation of a FAISS vector index.
Components 2 and 4 run in parallel after component 1 completes, saving roughly 10-15% of total execution time. It’s a modest gain for a single run, but it adds up when iterating on pipeline parameters.
[1] Anatomy classification, integrating MedImageInsight
The Anatomy classification component in the pipeline relies on MedImageInsight (MI2). MedImageInsight is Microsoft’s foundation model for medical image understanding, available through the Azure AI Foundry model catalog. Unlike generative models, MedImageInsight is an embedding model — it maps medical images and text into a shared 1024-dimensional vector space, enabling tasks like classification and similarity search by comparing image embeddings against text label embeddings.
Given a DICOM image, we compare its embedding against candidate labels (e.g., “Brain”, “Chest”, “Abdomen”) to determine the body part, scan orientation, and other imaging characteristics through zero-shot classification.
We also may get directly annotated anatomy from component 0, the DICOM metadata extractor component. We can combine both data points to build our final search index.
[2] [3] FAISS index construction
As an input to the FAISS index, we first run component 2, the text template generator. This component takes the metadata and anatomy information from components 0 and 1 and feeds them into 5 different agents with different instructions on how to describe the DICOM study. This results in textual descriptions which some variation, referred to as text templates, which can be indexed in the next component
The FAISS index builder (component 3) uses BiomedCLIP to encode all text templates into 512-dimensional vectors:
MODEL_NAME = “hf-hub:microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224” @torch.no_grad() def encode(self, texts: List[str], batch_size: int = 256) -> np.ndarray: embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] tokens = self.tokenizer(batch).to(self.device) batch_embeddings = self.model.encode_text(tokens) batch_embeddings = F.normalize(batch_embeddings, dim=-1) # L2 normalize embeddings.append(batch_embeddings.cpu().numpy()) return np.vstack(embeddings)
We L2-normalize all vectors and use faiss.IndexFlatIP (inner product), which is equivalent to cosine similarity on normalized vectors. For our current dataset sizes (thousands of series), flat indexing is fast enough. For hospital-scale datasets with millions of images, we might switch to IndexIVFFlat or IndexHNSW for approximate nearest neighbor search.
In the cohort explorer app, a user will enter a natural language query, which is then converted to embeddings using the same BiomedCLIP model. This allows a search using the FAISS index to find relevant DICOM studies.
[4] Visualization: making embeddings explorable
The scatter plot in the UI is often the first thing users interact with. It needs to show meaningful clusters without requiring users to understand dimensionality reduction.
Component 4 takes the embeddings from component 1 and projects them to 2D with UMAP:
umap = UMAP( n_components=2, n_neighbors=10, # Balances local vs. global structure min_dist=0.5, # Prevents over-clustering metric=’cosine’, # Matches our embedding similarity metric random_state=42 # Reproducible layouts ) coordinates_2d = umap.fit_transform(features)
Each point in the scatter plot corresponds to a single DICOM series produced by the pipeline, with color, grouping, and hover metadata derived directly from the JSON artifacts emitted by components 1 and 4.
Each pipeline run produces a small set of well-defined artifacts — metadata tables, embedding vectors, UMAP coordinates, and the FAISS index — which are consumed directly by the cohort exploration UI. The cohort explorer application can reload or switch between datasets.
The diagram is a screen capture of an Azure ML pipeline. It includes 5 pipeline components along with connecting arrows showing incoming and outgoing data, including the final outputs of the pipeline.
Pipeline execution: time, cost, and what we learned
Here’s what a typical pipeline run looks like for a dataset of ~4,500 DICOM series:
|
Component |
Task |
Approximate Time (CPU) |
Approximate Time (GPU) |
|
0 – DICOM Metadata Extractor |
Scan files, extract headers |
5-10 min |
5-10 min |
|
1 – Anatomy Classification |
Classify anatomy/orientation |
90-120 min |
5-10 min |
|
2 – Text Template Generator |
Generate 5 templates per series |
5-10 min |
5-10 min |
|
3 – FAISS Index Builder |
BiomedCLIP encoding + FAISS build |
60-90 min |
10-15 min |
|
4 – Visualization Enrichment |
UMAP + color assignment |
20-40 min |
5-10 min |
|
Azure ML overhead |
Compute provisioning, env setup |
5-10 min |
5-10 min |
|
Total |
~200-300 min |
~30-50 min |
Key observations:
- Azure ML overhead is significant when doing quick iteration and testing. Compute provisioning, conda environment builds, and data mounting add several minutes before any component code runs. We first built each component as python code to run locally and debug before our first Azure ML run. This way we quickly iterated and avoided cost until we were ready.
- BiomedCLIP encoding dominates on CPU. Component 3 is the bottleneck. Moving to GPU compute for this component cuts encoding time roughly in half, but GPU clusters cost more. For a pipeline you run occasionally, CPU is fine. For frequent re-indexing, GPU pays for itself.
- Batch size tuning matters. The default BiomedCLIP batch size of 256 balances memory and throughput. On GPU, you can push to 512. On CPU with limited RAM, drop to 128.
At Scale: 120,000 Images, CPU vs. GPU
We ran the full pipeline against a larger dataset of ~120,000 images to understand how compute choice affects end-to-end time and cost:
|
CPU Pipeline |
GPU Pipeline |
|
|
Pipeline compute time |
4 days, 12 hours (108 hrs) |
15 hours |
|
Pipeline compute cost |
~$0.25/hr × 108 hrs = ~$27 |
~$3.00/hr × 15 hrs = ~$45 |
|
MedImageInsight endpoint (MaaP on Standard_NC4as_T4_v3) |
~$151 |
~$21 |
|
Total estimated cost |
~$178 |
~$66 |
Both pipeline runs make the same ~120,000 classification calls to the MedImageInsight endpoint, but those calls are spread out over different time periods depending on how quickly and efficiently the pipeline can make the calls to MedImageInsight. The hourly cost for MedImageInsight on a Standard_NC4as_T4_v3 VM is ~$1.40/hr. Resulting in the estimated costs for MedImageInsight in the table above.
GPU compute was roughly 7× faster at about 0.37× the total cost when endpoint costs are included. This was a key learning and clearly indicates the benefits of the more powerful compute resources.
MedImageInsight can be deployed in two ways, depending on dataset size and operational needs.
For smaller or infrequently processed datasets, we deploy MedImageInsight as a managed Azure ML online endpoint and invoke it from the pipeline. This keeps the pipeline simpler and avoids managing the MedImageInsight compute directly, while offering comparable performance at modest scale.
For larger batch workloads, an alternative approach is to load MedImageInsight directly on the Azure ML pipeline’s GPU-backed compute. In this model, the pipeline handles both model loading and classification, eliminating per-request network round trips and the fixed cost of hosting a persistent endpoint.
While this approach requires slightly longer pipeline run time, it becomes more cost‑effective at scale by avoiding endpoint overhead and improving throughput during bulk processing.
Possible future enhancements
- Additional modalities: Extending the pipeline and classification to CT, X-ray, and ultrasound imaging, and build on the pattern for pathology images
- Image embeddings fusion: Combining MedImageInsight image embeddings with text embeddings for hybrid search
- Condition-aware search: Enabling queries about findings and conditions, not just imaging parameters
The gap between a hackathon demo and a production system is where the real engineering happens. We hope sharing our journey helps others building similar systems.
If you’re interested in partnering with us to work toward this goal or need access to the GitHub repo with the pipeline and UI code, contact authors through your Microsoft account team or reach out to Microsoft HLS AI frontier team
The healthcare AI models in Microsoft Foundry are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.