Unlocking AI Potential: Exploring the Model Context Protocol with AI Toolkit

May 19, 2025

Introducing Terraform support for Azure Monitor Baseline Alerts (AMBA) for Azure landing Zones (ALZ)

May 19, 2025

Published by azurefeeds on May 19, 2025

Abstract

This blog post shows how to build an enterprise-ready Retrieval-Augmented Generation (RAG) pipeline through a collaboration between NetApp, NVIDIA, and Microsoft. It provides a step-by-step guide and reference architecture for handling large-scale, multimodal enterprise content with fast, accurate responses powered by NVIDIA’s AI Blueprint for RAG and high-performance Azure NetApp Files storage. This architecture supports diverse applications including enterprise search, customer support, and compliance tools, offering a scalable and secure foundation for production generative AI implementations.

Co-authors:

Rajeev Chawla, Sr. Director Product Management, Azure NetApp Files

Joseph Wu, Senior Solutions Architect, NVIDIA

Alexander Zeltov, Senior Solutions Architect, NVIDIA

Abhishek Sawarkar, Product Manager, NVIDIA

Kyle Radder, Technical Marketing Engineer, Azure NetApp Files

Asutosh Panda, Technical Marketing Engineer, Azure NetApp Files

Larry Kuhn, Director, Partner Technology Strategist, Microsoft

Raj Nemani, Director, Partner Technology Strategist, Microsoft

Rajesh Vasireddy, Director, Partner Technology Strategist, Microsoft

Introduction

In today’s data-driven world, enterprises are sitting on a goldmine of information locked inside millions of documents, diagrams, and other formats. The challenge is turning that content into actionable insight. Retrieval-Augmented Generation (RAG) offers a powerful way to do that by combining large language models (LLMs) with the ability to pull context directly from your own enterprise data.

In this blog post, we’ll walk through a reference architecture for deploying RAG at enterprise scale using Microsoft Azure, NVIDIA’s AI Blueprint for RAG, and Azure NetApp Files. This setup supports production workloads involving multimodal content and high volumes of data. It’s built to scale, perform, and meet enterprise-grade requirements for compliance, security, and governance.

Enterprise RAG: Challenges and Requirements

Scaling RAG across an enterprise means facing some real challenges. You’re dealing with massive volumes of content, tight latency requirements, and the need for accurate results—all without compromising compliance or performance. To make RAG work at enterprise scale, you’ll need to solve challenges like these:

Handling multimodal content: Enterprise data isn’t just plain text. It includes images, tables, diagrams, and complex document layouts. Your RAG system needs to handle all of it.

Managing large document volumes: Many organizations are working with millions of files. You need a pipeline that can scale ingestion, processing, and storage without breaking down.

Keeping latency low: For applications like chatbots and interactive search tools, users expect near-instant answers. That means fast retrieval, efficient search, and minimal delay.

Delivering relevant answers: An LLM is only as good as the context it receives. You need strong embedding models, smart chunking, and sometimes reranking to ensure relevance.

Data Access Control and Permission Management: Users should only receive results from documents they are authorized to access. This requires capturing file system permissions during ingestion from Azure NetApp Files and filtering search results based on user identity and authorization levels. We’ll explore strategies for integrating file system ACLs and permissions with vector retrieval in an upcoming blog post focused on secure enterprise RAG implementations.

Meeting enterprise standards: RAG systems must also support security, compliance, availability, and cost-efficiency.

NVIDIA AI Blueprint for RAG

NVIDIA’s AI Blueprint for RAG provides a flexible and powerful foundation for building high-performance RAG pipelines.

It includes:

NeMo Retriever Models: Pretrained models for extracting text, tables, and visual elements from PDFs and other complex formats:

NeMo Retriever for page elements extraction

NeMo Retriever for table structure recognition

NeMo Retriever for graphic elements detection

NVIDIA NIM Microservices: Prebuilt, highly optimized inference microservices for deploying embedding, re-ranking and LLMs on NVIDIA GPUs.

GPU Acceleration for Vector Operations: The Blueprint leverages NVIDIA GPUs to accelerate operations, including:

Vector embedding generation

Similarity search computations

Document processing tasks

This acceleration delivers the performance needed for enterprise-scale implementations, significantly reducing processing time and improving query response times.

Adapting the Blueprint for Azure Deployment

To run this architecture on Azure, you map components as follows:

Infrastructure Mapping: Map the Blueprint components to appropriate Azure services:
- Azure Kubernetes Service (AKS) for container orchestration
- Azure NetApp Files for high-performance storage

Networking Configuration: Set up proper networking with VNet peering between AKS and Azure NetApp Files to ensure high-throughput data access.

Resource Sizing: Configure appropriate GPU-enabled virtual machines and storage tiers to meet performance requirements.

This gives you the power of NVIDIA’s reference stack with the manageability and scale of Azure.

Azure NetApp Files: Powering High-Performance RAG Workloads

Azure NetApp Files is a fully managed, enterprise-grade file storage service that brings NetApp’s proven data management capabilities to the Azure ecosystem. Azure NetApp Files delivers the performance, reliability, and manageability you need for enterprise-grade RAG pipelines.

Why Azure NetApp Files works well for RAG

Azure NetApp Files provides a robust foundation for enterprise RAG workloads through its comprehensive feature set:

Enterprise-grade reliability with 99.99% availability SLA

Seamless integration with Azure services and existing applications

Comprehensive security with encryption at rest and in transit

Simplified management through the Azure portal and APIs

These capabilities ensure that RAG pipelines built on Azure NetApp Files meet enterprise requirements for reliability, security, and manageability.

Service Levels for RAG Workloads

Azure NetApp Files offers multiple service levels that can be aligned with different phases of the RAG pipeline:

Standard: Delivers 16 MiB/s per terabyte of allocated capacity, suitable for initial data ingestion and document storage.

Premium: Provides 64 MiB/s per terabyte, ideal for active document processing and embedding generation.

Ultra: Offers 128 MiB/s per terabyte, ensuring maximum performance for high-throughput retrieval operations.

Flexible: Allows independent provisioning of capacity and throughput for custom capacity/performance workload profiles.

This approach allows organizations to match storage performance to workload requirements, optimizing both performance and cost.

Dynamic Service Level Adjustments

One of the most powerful features of Azure NetApp Files for RAG workloads is the ability to change service levels dynamically without data migration:

Volumes can be moved between tiers without moving data

No application reconfiguration required

No disruption to data access

Changes take effect immediately

This capability enables organizations to adapt storage performance to changing workload demands, such as scaling up during intensive embedding generation phases and scaling down for steady-state operations.

Snapshot Capabilities for ML Versioning

Azure NetApp Files provides built-in snapshot functionality that supports critical versioning needs in ML pipelines:

Create instant, space-efficient snapshots of data volumes

Capture point-in-time versions of datasets, embeddings, and models

Enable rapid rollback to previous states for reproducibility

Support A/B testing of different model versions

Integrate with orchestration frameworks through APIs

These capabilities ensure that the data and model states are properly versioned, supporting reproducibility and auditability requirements in enterprise AI systems.

Cost Optimization Strategies

Azure NetApp Files offers several approaches to optimize storage costs in RAG implementations:

Right-sizing volumes based on actual data requirements

Leveraging appropriate storage tiers for different workload phases

Using cool access tiering for infrequently accessed data

Employing space-efficient snapshots instead of full copies

Utilizing reserved capacity options for predictable workloads

By implementing these strategies, organizations can achieve significant cost savings while maintaining the performance needed for effective RAG pipelines.

Azure Reference Architecture for Enterprise RAG

Let’s break down the main building blocks of the architecture and how they work together.

Azure Kubernetes Service (AKS) with NVIDIA GPU Nodes: AKS provides the orchestration layer for running containerized AI services. It supports GPU acceleration and scales to meet dynamic workloads. The AKS environment includes the following key elements:

System nodes handle pipeline orchestration, logging, and operational control.

GPU-enabled worker nodes (e.g., NC-series VMs with NVIDIA H100 or A100) run AI model workloads:
- NVIDIA NIM for embedding generation and LLM-based response generation.
- NVIDIA NeMo Retriever models for extracting content from documents, including text, tables, and visuals.
- NVIDIA Container Runtime enables GPU acceleration within Kubernetes pods.
- Kubernetes operators manage deployment, scaling, and lifecycle of containerized services.

Azure NetApp Files: provides high-performance storage for document repositories, vector embeddings, and model artifacts. Key capabilities include:

Flexible, Premium or Ultra storage volumes for active workloads

Flexible or Standard storage for less performance-intensive storage

Cross-region or cross-zone replication for high availability

Snapshot management for versioning

Milvus Vector Database: Milvus is an open-source vector database, accelerated by NVIDIA cuVS, that provides efficient vector storage, similarity search, and metadata management for embeddings. In this architecture, it runs within the AKS cluster to support scalable and high-performance retrieval operations. While this implementation uses default storage configuration, we will show in a future blog post how to configure Milvus to leverage Azure NetApp Files for all storage requirements – using Persistent Volumes (PVs) backed by Azure NetApp Files NFS volumes for Milvus cluster components. This configuration provides significant performance benefits including high-throughput data access, dynamic performance scaling, and enterprise-grade reliability for vector operations.

Azure Networking Components: Azure networking ensures secure, high-throughput communication between services deployed in the architecture. The key components include:

VNet peering between AKS and Azure NetApp Files

Network Security Groups for traffic filtering

Azure Private DNS for name resolution

End-to-End Workflow

Here’s how the end-to-end RAG pipeline workflow plays out:

Document Ingestion: Enterprise documents are ingested from Azure NetApp Files volumes into the pipeline.

Document Processing: AKS-hosted microservices process documents, extracting text, tables, and visual elements using NVIDIA NeMo Retriever models.

Embedding Generation: The extracted content is transformed into high-dimensional vector embeddings using NVIDIA NIM microservices, which are GPU-accelerated for high throughput.

Vector Storage: These embeddings, along with associated metadata, are stored in the Milvus vector database, where they are indexed for fast similarity-based retrieval.

Query Processing: When users submit queries, the system first encodes the query into a vector using the same embedding models.

Retrieval: Milvus performs a similarity search to retrieve the most relevant content chunks. These results may optionally be reranked for improved relevance.

Response Generation: The retrieved context is injected into a prompt to the LLM, which then generates a coherent, context-aware response grounded in enterprise data.

Result Delivery: The final response is returned to the user via the application interface, completing the RAG cycle.

This workflow ensures low-latency, accurate responses grounded in your proprietary enterprise data and supports scalable, multimodal processing suitable for enterprise use cases.

Implementation Guide: Building the Pipeline

Setup your Bash shell

# install az cli
$ curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
#
# install aks-preview extension
$ az extension add –name aks-preview
#
# install kubectl
$ curl -LO “https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl”
$ sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
#
# install helm
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
$ chmod 700 get_helm.sh
$ ./get_helm.sh

Don’t be concerned about the warning message from the aks-preview extension. The standard aks extension provides the default CUDA version, while aks-preview extension offers the latest CUDA version. That’s why we chose to use the aks-preview extension.

Setup your Azure Account

Make sure the region you intend to work in has at least 200 vCores of NCadsH100v5 or 120 vCores of NCADS_A100_v4. Click the pencil icon if you need to request more quota.

Create a regular AKS cluster in your desired region.

az login –use-device-code
az aks create -g $RESOURCE_GROUP-n $CLUSTER_NAME –location $LOCATION

Set environment variables

$ export RESOURCE_GROUP=rag-rg
$ export CLUSTER_NAME=rag-aks
$ export LOCATION=westus2
$ export SLICING_GPUNP=gpunp-slicing
$ export SLICING_GPUNP=gpunp
$ export NGC_API_KEY=”your_ngc_api_key”
$ export NVIDIA_API_KEY=”nvapi-*”

Create 2 GPU nodepools: The RAG Blueprint requires 9 GPUs. We can use time-slicing to reduce the requirement from 9 to 5. However, since not all pods are compatible with time-slicing, we need two GPU node pools: one with time-slicing enabled and one without.

Create 1st GPU nodepool (with time-slicing)

$ az aks nodepool add –resource-group $RESOURCE_GROUP –cluster-name $CLUSTER_NAME –name $SLICING_GPUNP –node-count 1 –skip-gpu-driver-install –node-vm-size Standard_NC24ads_A100_v4 –node-osdisk-size 1024 –max-pods 110

Example:

Setup Time-slicing

$ cat < time-slicing-config-fine.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config-fine
data:
a100-80gb: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
– name: nvidia.com/gpu
replicas: 6
EOF

$ kubectl create -n gpu-operator -f time-slicing-config-fine.yaml

$ kubectl patch clusterpolicies.nvidia.com/cluster-policy
-n gpu-operator –type merge
-p ‘{“spec”: {“devicePlugin”: {“config”: {“name”: “time-slicing-config-fine”}}}}’

$ kubectl label node
–selector=nvidia.com/gpu.product=NVIDIA-A100-PCIe-80GB
nvidia.com/device-plugin.config=a100-80gb

Create 2nd GPU nodepool (no time-slicing)

$ az aks nodepool add –resource-group $RESOURCE_GROUP –cluster-name $CLUSTER_NAME –name $NOSLICING_GPUNP –node-count 2 –skip-gpu-driver-install –node-vm-size Standard_NC48ads_A100_v4 –node-osdisk-size 1024 –max-pods 110

Install local path storage:

$ kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
$ kubectl get pods -n local-path-storage
$ kubectl get storageclass
$ kubectl patch storageclass local-path -p ‘{“metadata”: {“annotations”:{“storageclass.kubernetes.io/is-default-class”:”true”}}}’

Install GPU Operator

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
$ helm repo update
$ helm install –create-namespace –namespace gpu-operator nvidia/gpu-operator –wait –generate-name

Prepare Helm Chart: download and modify for time-slicing.

Download helm chart:

$ helm repo add nvidia-nim https://helm.ngc.nvidia.com/nim/nvidia/ –username=’$oauthtoken’ –password=$NGC_API_KEY
$ helm repo add nim https://helm.ngc.nvidia.com/nim/ –username=’$oauthtoken’ –password=$NGC_API_KEY
$ helm repo add nemo-microservices https://helm.ngc.nvidia.com/nvidia/nemo-microservices –username=’$oauthtoken’ –password=$NGC_API_KEY
$ helm repo add baidu-nim https://helm.ngc.nvidia.com/nim/baidu –username=’$oauthtoken’ –password=$NGC_API_KEY
$ helm repo update
$ git clone https://gitlab.com/NVIDIA-AI-Blueprints/rag.git
$ cd rag/deploy/helm
$ helm dependency update rag-server/charts/ingestor-server
$ helm dependency update rag-server

Modify helm chart:

# unzip the helm chart for modification

$ cd rag-server/charts
$ rm -rf ingestor-server
$ tar -xvf ingestor-server-v2.0.0.tgz
$ rm ingestor-server-v2.0.0.tgz
$ tar -xvf nim-llm-1.3.0.tgz
$ rm nim-llm-1.3.0.tgz
$ tar -xvf text-reranking-nim-1.3.0.tgz
$ rm text-reranking-nim-1.3.0.tgz
$ tar -xvf nvidia-nim-llama-32-nv-embedqa-1b-v2-1.5.0.tgz
$ rm nvidia-nim-llama-32-nv-embedqa-1b-v2-1.5.0.tgz

# Modify nodeSelector for nv-ingest

$ cd ingestor-server/charts/nv-ingest/charts
#
# for the following files
#
# nvidia-nim-paddleocr/values.yaml
# nvidia-nim-nemoretriever-graphic-elements-v1/values.yaml
# nvidia-nim-nemoretriever-page-elements-v2/values.yaml
# nvidia-nim-nemoretriever-table-structure-v1/values.yaml
# milvus/values.yaml
#
# change
# nodeSelector: {} # likely best to set this to `nvidia.com/gpu.present: “true”` depending on cluster setup
# into
# nodeSelector:
# nvidia.com/gpu.sharing-strategy: “time-slicing”
$ cd ../../../..

# Modify nodeSelector for frontend

$ kubectl get node
# write down the full node name of GPU nodepool without time slicing

#
# for the following files
#
# nvidia-nim-llama-32-nv-embedqa-1b-v2/values.yaml
# text-reranking-nim/values.yaml
#
# change
# nodeSelector: {} # likely best to set this to `nvidia.com/gpu.present: “true”` depending on cluster setup
# into
# nodeSelector:
# kubernetes.io/hostname:
#
# for the following files
#
# nim-llm/values.yaml
#
# change
# nodeSelector: {} # likely best to set this to `nvidia.com/gpu.present: “true”` depending on cluster setup
# into
# nodeSelector:
# kubernetes.io/hostname:

Deploy Helm Chart:

$ kubectl create namespace rag
$ helm install rag -n rag rag-server/
–set imagePullSecret.password=$NVIDIA_API_KEY
–set ngcApiSecret.password=$NVIDIA_API_KEY

The following is the desired result: Ensure all extraction pods are located on the same node and share a single GPU. The nim-llm pod should use one node with 2 GPUs. The remaining two pods in the retrieval pipeline should share one node.

And all pods are up and running without error.

Launch RAG:

# Port forwarding for UI
$ kubectl port-forward -n rag service/rag-frontend 3000:3000 –address 0.0.0.0

If you are using port forwarding from your local Ubuntu or Mac machine, open http://localhost:3000 in your browser.

If you are using port forwarding from your WSL, open http://:3000 in a browser on Windows. You can use ifconfig in WSL to find the IP address.

Create work pod: A pod with an Azure NetApp Files NFS volume increased batch PDF ingestion significantly.

Create an Azure NetApp Files NFS volume:

$ az provider register –namespace Microsoft.NetApp –wait
$ az netappfiles account create –resource-group $RESOURCE_GROUP –location $LOCATION –account-name $ANF_ACCOUNT_NAME
$ az netappfiles pool create –resource-group $RESOURCE_GROUP –location $LOCATION –account-name $ANF_ACCOUNT_NAME –pool-name $POOL_NAME –size $SIZE –service-level $SERVICE_LEVEL
$ az network vnet subnet create –resource-group $RESOURCE_GROUP –vnet-name $VNET_NAME –name $SUBNET_NAME –delegations “Microsoft.Netapp/volumes” –address-prefixes $ADDRESS_PREFIX
$ az netappfiles volume create –resource-group $RESOURCE_GROUP –location $LOCATION –account-name $ANF_ACCOUNT_NAME –pool-name $POOL_NAME –name “$VOLUME_NAME” –service-level $SERVICE_LEVEL –vnet $VNET_ID –subnet $SUBNET_ID –usage-threshold $VOLUME_SIZE_GIB –file-path $UNIQUE_FILE_PATH –protocol-types NFSv3

(!) Note

Make sure that your AKS cluster’s virtual network (VNet) and subnet match the volume’s VNet and subnet.

Create a persistent volume claim for your Azure NetApp Files volume:

$ cat < pv-nfs.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-nfs
spec:
capacity:
storage: 100Gi
accessModes:
– ReadWriteMany
mountOptions:
– vers=3
nfs:
server: 10.0.0.4 # Go to Azure portal to check the mountpoint of
path: /myfilepath2 # your azure netapp files volume
EOF

$ kubectl apply -f pv-nfs.yaml
$ cat < pvc-nfs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-nfs
spec:
accessModes:
– ReadWriteMany
storageClassName: “”
resources:
requests:
storage: 100Gi
EOF

$ kubectl apply -f pvc-nfs.yaml

Create and mount the Azure NetApp Files NFS volume with the work pod:

$ cat < nv-ingest-test.yaml
kind: Pod
apiVersion: v1
metadata:
name: nv-ingest-test
spec:
containers:
– image: ubuntu
name: nv-ingest-test
command:
– “/bin/sh”
– “-c”
– while true; do echo $(date) >> /mnt/azure/outfile; sleep 1; done
volumeMounts:
– name: disk01
mountPath: /mnt/azure
volumes:
– name: disk01
persistentVolumeClaim:
claimName: pvc-nfs
EOF

$ kubectl apply -f nv-ingest-test.yaml

Set up the work pod:

$ kubectl exec -it nv-ingest-test — bash
root@nv-ingest-test:/# apt-get update
root@nv-ingest-test:/# apt-get install python3-pip
root@nv-ingest-test:/# pip install nv-ingest-client –break-system-packages
root@nv-ingest-test:/# apt-get install python3-pypdf2
root@nv-ingest-test:/# cd ~
root@nv-ingest-test:~# cat < ingestion.py
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the “License”);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
“””
This script is used to ingest a corpus of documents into a vector database.

Requirements:
“`
pip install -r requirements.txt
python3 -c “import nltk; nltk.download(‘punkt’)”
“`

Usage:
python ingestion.py [options]

Options:
–nv_ingest_host HOST Host where nv-ingest-ms-runtime is running (default: localhost)
–nv_ingest_port PORT REST port for NV Ingest (default: 7670)
–milvus_uri URI Milvus URI for external ingestion (default: http://localhost:19530)
–minio_endpoint ENDPOINT MinIO endpoint for external ingestion (default: localhost:9010)
–collection_name NAME Name of the collection (default: bo767_test)
–folder_path PATH Path to the data files (default: “/path/to/bo767/corpus/”)

Example:
python ingestion.py –nv_ingest_host localhost –nv_ingest_port 7670 –milvus_uri http://localhost:19530 –minio_endpoint localhost:9010 –collection_name bo767_test –folder_path “/path/to/bo767/corpus/”

“””
import os
import time
import argparse
import PyPDF2
from tqdm import tqdm
from nv_ingest_client.client import Ingestor, NvIngestClient

def parse_args():
“””
Parse the arguments for the ingestion script.
“””
parser = argparse.ArgumentParser(
description=’Ingest a corpus of documents into a vector database’,
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=’For more information, see the example usage above.’
)
parser.add_argument(‘–nv_ingest_host’, type=str, default=”localhost”,
help=’Host where nv-ingest-ms-runtime is running’)
parser.add_argument(‘–nv_ingest_port’, type=int, default=7670,
help=’REST port for NV Ingest’)
parser.add_argument(‘–milvus_uri’, type=str, default=”http://localhost:19530″,
help=’Milvus URI for external ingestion’)
parser.add_argument(‘–minio_endpoint’, type=str, default=”localhost:9010″,
help=’MinIO endpoint for external ingestion’)
parser.add_argument(‘–collection_name’, type=str, default=”bo767_test”,
help=’Name of the collection’)
parser.add_argument(‘–folder_path’, type=str, default=”/path/to/bo767/corpus/”,
help=’Path to the data files’)
parser.add_argument(‘–skip_vdb_upload’, action=’store_true’,
help=’Skip the vector database upload’)
return parser.parse_args()

# Parse the arguments
args = parse_args()

# Print the configuration
print(“n” + “=”*80)
print(“PERFORMING INGESTION WITH THE FOLLOWING CONFIGURATION:”)
print(“=”*80)
print(f”NV-Ingest Host: {args.nv_ingest_host}”)
print(f”NV-Ingest Port: {args.nv_ingest_port}”)
print(f”Milvus URI: {args.milvus_uri}”)
print(f”MinIO Endpoint: {args.minio_endpoint}”)
print(f”Collection Name: {args.collection_name}”)
print(f”Folder Path: {args.folder_path}”)
print(“-“*80 + “n”)

# Compute the total number of pages of all the pdfs in the folder
print(“nComputing the total number of pages of all the pdfs in the folder…”)
total_pages = 0
pdf_count = 0

for file in tqdm(os.listdir(args.folder_path)):
try:
if file.endswith(“.pdf”) or file.endswith(“.txt”):
pdf_count += 1
pdf_path = os.path.join(args.folder_path, file)
with open(pdf_path, “rb”) as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
total_pages += len(reader.pages)
except Exception as e:
print(f”Error processing {file}: {e}”)

print(f”Total PDF files: {pdf_count}”)
print(f”Total pages in all PDFs: {total_pages}”)

# Server Mode
client = NvIngestClient(
# host.docker.internal (from inside docker)
message_client_hostname=args.nv_ingest_host, # Host where nv-ingest-ms-runtime is running
message_client_port=args.nv_ingest_port # REST port, defaults to 7670
)

# Create the ingestor instance
ingestor = Ingestor(client=client)

# Add the files to the ingestor
ingestor = ingestor.files(os.path.join(args.folder_path, “*”))

# Extract the text, tables, charts, images from the files
ingestor = ingestor.extract(
extract_text=True,
extract_tables=True,
extract_charts=True,
extract_images=False,
text_depth=”page”,
paddle_output_format=”markdown”
)

# Split the text into chunks
ingestor = ingestor.split(
tokenizer=”intfloat/e5-large-unsupervised”,
chunk_size=512,
chunk_overlap=150,
params={“split_source_types”: [“PDF”, “text”]}
)

# Embed the chunks
ingestor = ingestor.embed()

# Upload the chunks to the vector database
if not args.skip_vdb_upload:
print(“nAdding task to upload the chunks to the vector database…”)
ingestor = ingestor.vdb_upload(
collection_name=args.collection_name,
milvus_uri=args.milvus_uri,
minio_endpoint=args.minio_endpoint,
sparse=False,
enable_images=False,
recreate=False,
dense_dim=2048
)
else:
print(“nSkipping vector database upload…”)

print(“nStarting ingestion…”)
# Ingest the chunks
start = time.time()
# results blob is directly inspectable
results = ingestor.ingest(show_progress=True)
total_ingestion_time = time.time() – start

# Get count of result elements
print(“nCounting the number of result elements…”)
result_elements_count = 0
for result in tqdm(results):
for result_element in result:
result_elements_count += 1

print(“n” + “=”*80)
print(“INGESTION PERFORMANCE METRICS:”)
print(“=”*80)
print(f”Total ingestion time: {total_ingestion_time:.2f} seconds”)
print(f”Total pages ingested: {total_pages}”)
print(f”Pages per second: {total_pages / total_ingestion_time:.2f}”)
print(f”Total result files: {len(results)}”)
print(f”Total result elements/chunks: {result_elements_count}”)
print(“-” * 80)
EOF

Use the work pod to ingest PDF files:

There are two ways to ingest PDF files: using the nv-ingest CLI client or the nv-ingest Python client. With the CLI client, the output is saved to a local output directory, and you’ll need to manually push the results to the vector database. In contrast, the Python client automatically pushes the results to the vector database for you. The following code snippet demonstrates both methods.

# Download dataset.zip from https://www.kaggle.com/datasets/manisha717/dataset-of-pdf-files

$ kubectl get svc -n rag
#
# Write down ip of nv-ingest service
#
$ kubetl cp dataset.zip nv-ingest-test:/tmp
$ kubectl exec -it nv-ingest-test — bash
root@nv-ingest-test:/# cp /tmp/dataset.zip /mnt/azure
root@nv-ingest-test:/# cd /mnt/azure
root@nv-ingest-test:/mnt/azure# unzip dataset.zip
root@nv-ingest-test:/mnt/azure# cd Pdf

#
# CLI Client method
#
root@nv-ingest-test:/mnt/azure/Pdf# nv-ingest-cli –doc “./*.pdf” –output_directory ./processed_docs –task=’extract:{“document_type”: “pdf”, “extract_text”: true, “extract_images”: true, “extract_tables”: true, “extract_charts”: true}’ –client_host= –client_port=7670

# It is normal that some file failed. it will spend about 30 minutes to finish.#
# Python Client method
#
root@nv-ingest-test:~# kubectl get svc -n rag

# Host where nv-ingest-ms-runtime is running (default: localhost)
# REST port for NV Ingest (default: 7670)
# Milvus URI for external ingestion (default: http://localhost:19530)
# MinIO endpoint for external ingestion (default: localhost:9010)
# Name of the collection (default: bo767_test)
# Path to the data files (default: “/path/to/bo767/corpus/”)

root@nv-ingest-test:~# python3 ingestion.py
–nv_ingest_host HOST
–nv_ingest_port PORT
–milvus_uri URI
–minio_endpoint ENDPOINT
–collection_name NAME
–folder_path PATH

Evaluating Your Enterprise RAG Pipeline

Before scaling a RAG solution across the enterprise, it’s important to know how well it performs. That means going beyond intuition and using a structured evaluation framework that covers accuracy, speed, efficiency, cost, and operational reliability.

In this post, we’ll introduce the key dimensions of that framework. In future blog posts, we’ll go deeper into each one through benchmarking, performance tuning, and real-world testing.

Retrieval Accuracy

Let’s start with the heart of any RAG system: how accurately it retrieves relevant content.

Metrics like Precision@K, Recall@K, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (nDCG) help quantify both the relevance and the order of results returned. These numbers tell us how well the system finds useful content and whether it ranks it in a way that helps the model generate accurate responses.

We’ll be benchmarking these metrics using real-world, human-annotated datasets across domains like finance, healthcare, and legal to see how different retrieval strategies stack up.

Latency and Throughput

Performance isn’t just about correctness. It’s also about speed, especially for interactive use cases like chat or search.

We’ll measure end-to-end latency, analyze how much time is spent in each stage of the pipeline (embedding, retrieval, generation), and test how well the system handles load. Metrics like query throughput and concurrent user support will help determine how scalable the system is under pressure.

GPU Utilization

GPUs are powerful, but they’re also expensive. To keep costs manageable, we need to measure how effectively they’re being used.

We’ll track metrics like average GPU utilization, memory usage, batching efficiency, and multi-GPU scaling performance. The goal is to identify areas where tuning the pipeline can improve performance without overprovisioning hardware.

Cost Analysis

Cost plays a huge role in determining whether a solution is viable in the long run. We’ll look at cost per document processed, cost per query, and compare different storage and compute configurations. This helps make informed trade-offs between performance and cost.

What’s Coming Next

This is just the start. In upcoming posts, we’ll take a closer look at benchmarking Milvus retrieval performance, tuning embedding pipelines, and evaluating domain-specific RAG use cases in sectors like healthcare.

We’ll also explore cost optimization strategies, GPU scaling techniques, and ways to improve user experience with lower latency and more accurate responses.

Our goal is to give you the tools and data you need to build and refine a RAG system that performs well, scales with your needs, and stays within budget.

Enterprise Use Cases and Real-World Applications

The true impact of a RAG system lies in solving specific enterprise problems where traditional tools fall short. Below are examples of how this reference architecture addresses critical pain points across industries.

Enterprise Search

The problem: Employees waste hours searching across disconnected systems, scanning lengthy documents or using ineffective keyword-based search that misses important context.

The solution: A RAG-powered enterprise search system understands natural language queries, retrieves relevant content based on semantic similarity, and presents synthesized answers. It connects disparate data sources and surfaces precise insights in seconds, improving productivity and enabling faster decisions.

Customer Support

The problem: Support agents struggle to find answers hidden in complex documentation, while customers grow frustrated by limited self-service tools and slow response times.

The solution: RAG systems automatically retrieve relevant information from support databases, manuals, and historical tickets. Agents receive real-time recommendations, and customers can access AI-driven self-service portals that provide instant, accurate responses. This shortens resolution times, boosts first-call resolution, and enhances satisfaction.

Regulatory Compliance and Document Processing

The problem: Staying compliant in highly regulated industries requires constant review of policies, regulations, and documentation—an error-prone and time-consuming task.

The solution: A RAG pipeline can extract, analyze, and cross-reference content from regulatory documents, internal policies, and operational logs. It flags gaps, supports policy drafting, and ensures alignment with regulatory changes. This reduces risk, increases auditability, and lowers compliance costs.

Financial Services

The problem: Analysts spend significant time reviewing reports, filings, and market data scattered across siloed sources, delaying insights and increasing operational costs.

The solution: RAG pipelines accelerate investment research, automate due diligence, and provide risk insights by pulling from structured and unstructured financial documents. These systems support personalized wealth management and fraud detection workflows with greater efficiency and depth.

Healthcare

The problem: Physicians and researchers are overwhelmed by vast volumes of medical literature, patient records, and guidelines, making it difficult to access timely, relevant information.

The solution: RAG systems summarize patient histories, surface clinical guidelines, and retrieve medical literature that supports diagnosis and treatment. They also enhance research by synthesizing findings across studies. This reduces medical errors, improves outcomes, and accelerates innovation.

Legal and Professional Services

The problem: Legal teams deal with thousands of pages of case law, contracts, and compliance documents, requiring extensive time and effort to extract relevant details.

The solution: A RAG pipeline helps identify key clauses, precedents, and risks by analyzing legal documents semantically. It accelerates due diligence, reduces manual review, and enhances decision-making with contextual insights.

Wrapping Up

If you’re planning to build a Retrieval-Augmented Generation pipeline at enterprise scale, combining Microsoft Azure, NVIDIA’s RAG Blueprint, and Azure NetApp Files gives you a production-ready framework. NVIDIA delivers accelerated AI through its RAG Blueprint running in Azure, and Azure NetApp Files provides high-performance, enterprise-grade storage. This setup solves key challenges like extracting insights from multimodal enterprise content, delivering low-latency responses, and maintaining enterprise-grade security and compliance.

It connects powerful large language models to your company’s proprietary data and gives you the scale, performance, and reliability needed for production workloads.

Here’s what this solution delivers:

It scales with your data and users

It’s fast, thanks to accelerated computing and high-performance storage

It’s flexible enough to handle different types of content

It helps control costs by using the right resources at the right time

And it’s built with security and compliance in mind

As generative AI becomes more mainstream, RAG is emerging as a practical way to get real value from it, especially when you need answers based on your own knowledge rather than only what the model was trained on.

So if you’re exploring how to bring AI into your business without giving up control over your data or your costs, this architecture gives you a solid, proven path forward.

Unlocking AI Potential: Exploring the Model Context Protocol with AI Toolkit

Introducing Terraform support for Azure Monitor Baseline Alerts (AMBA) for Azure landing Zones (ALZ)

Unlocking AI Potential: Exploring the Model Context Protocol with AI Toolkit

Introducing Terraform support for Azure Monitor Baseline Alerts (AMBA) for Azure landing Zones (ALZ)

Table of Contents

Abstract

Introduction

Enterprise RAG: Challenges and Requirements

NVIDIA AI Blueprint for RAG

Adapting the Blueprint for Azure Deployment

Azure NetApp Files: Powering High-Performance RAG Workloads

Why Azure NetApp Files works well for RAG

Service Levels for RAG Workloads

Dynamic Service Level Adjustments

Snapshot Capabilities for ML Versioning

Cost Optimization Strategies

Azure Reference Architecture for Enterprise RAG

End-to-End Workflow

Implementation Guide: Building the Pipeline

Setup your Bash shell

Setup your Azure Account

Set environment variables

Evaluating Your Enterprise RAG Pipeline

Retrieval Accuracy

Latency and Throughput

GPU Utilization

Cost Analysis

What’s Coming Next

Enterprise Use Cases and Real-World Applications

Enterprise Search

Customer Support

Regulatory Compliance and Document Processing

Financial Services

Healthcare

Legal and Professional Services

Wrapping Up

Related posts

NLWeb Pioneer Q&A: Eventbrite

Announcing the General Availability (GA) of JSON data type and JSON aggregates

Microsoft Defender for Office 365’s Language AI for Phish: A Step Forward in Email Security