Unlocking AI Potential: Exploring the Model Context Protocol with AI Toolkit
May 19, 2025Introducing Terraform support for Azure Monitor Baseline Alerts (AMBA) for Azure landing Zones (ALZ)
May 19, 2025Table of Contents
Enterprise RAG: Challenges and Requirements
Adapting the Blueprint for Azure Deployment
Azure NetApp Files: Powering High-Performance RAG Workloads
Why Azure NetApp Files works well for RAG
Service Levels for RAG Workloads
Dynamic Service Level Adjustments
Snapshot Capabilities for ML Versioning
Azure Reference Architecture for Enterprise RAG
Implementation Guide: Building the Pipeline
Evaluating Your Enterprise RAG Pipeline
Enterprise Use Cases and Real-World Applications
Regulatory Compliance and Document Processing
Legal and Professional Services
Abstract
This blog post shows how to build an enterprise-ready Retrieval-Augmented Generation (RAG) pipeline through a collaboration between NetApp, NVIDIA, and Microsoft. It provides a step-by-step guide and reference architecture for handling large-scale, multimodal enterprise content with fast, accurate responses powered by NVIDIA’s AI Blueprint for RAG and high-performance Azure NetApp Files storage. This architecture supports diverse applications including enterprise search, customer support, and compliance tools, offering a scalable and secure foundation for production generative AI implementations.
Co-authors:
- Rajeev Chawla, Sr. Director Product Management, Azure NetApp Files
- Joseph Wu, Senior Solutions Architect, NVIDIA
- Alexander Zeltov, Senior Solutions Architect, NVIDIA
- Abhishek Sawarkar, Product Manager, NVIDIA
- Kyle Radder, Technical Marketing Engineer, Azure NetApp Files
- Asutosh Panda, Technical Marketing Engineer, Azure NetApp Files
- Larry Kuhn, Director, Partner Technology Strategist, Microsoft
- Raj Nemani, Director, Partner Technology Strategist, Microsoft
- Rajesh Vasireddy, Director, Partner Technology Strategist, Microsoft
Introduction
In today’s data-driven world, enterprises are sitting on a goldmine of information locked inside millions of documents, diagrams, and other formats. The challenge is turning that content into actionable insight. Retrieval-Augmented Generation (RAG) offers a powerful way to do that by combining large language models (LLMs) with the ability to pull context directly from your own enterprise data.
In this blog post, we’ll walk through a reference architecture for deploying RAG at enterprise scale using Microsoft Azure, NVIDIA’s AI Blueprint for RAG, and Azure NetApp Files. This setup supports production workloads involving multimodal content and high volumes of data. It’s built to scale, perform, and meet enterprise-grade requirements for compliance, security, and governance.
Enterprise RAG: Challenges and Requirements
Scaling RAG across an enterprise means facing some real challenges. You’re dealing with massive volumes of content, tight latency requirements, and the need for accurate results—all without compromising compliance or performance. To make RAG work at enterprise scale, you’ll need to solve challenges like these:
- Handling multimodal content: Enterprise data isn’t just plain text. It includes images, tables, diagrams, and complex document layouts. Your RAG system needs to handle all of it.
- Managing large document volumes: Many organizations are working with millions of files. You need a pipeline that can scale ingestion, processing, and storage without breaking down.
- Keeping latency low: For applications like chatbots and interactive search tools, users expect near-instant answers. That means fast retrieval, efficient search, and minimal delay.
- Delivering relevant answers: An LLM is only as good as the context it receives. You need strong embedding models, smart chunking, and sometimes reranking to ensure relevance.
- Data Access Control and Permission Management: Users should only receive results from documents they are authorized to access. This requires capturing file system permissions during ingestion from Azure NetApp Files and filtering search results based on user identity and authorization levels. We’ll explore strategies for integrating file system ACLs and permissions with vector retrieval in an upcoming blog post focused on secure enterprise RAG implementations.
- Meeting enterprise standards: RAG systems must also support security, compliance, availability, and cost-efficiency.
NVIDIA AI Blueprint for RAG
NVIDIA’s AI Blueprint for RAG provides a flexible and powerful foundation for building high-performance RAG pipelines.
It includes:
NeMo Retriever Models: Pretrained models for extracting text, tables, and visual elements from PDFs and other complex formats:
- NeMo Retriever for page elements extraction
- NeMo Retriever for table structure recognition
- NeMo Retriever for graphic elements detection
NVIDIA NIM Microservices: Prebuilt, highly optimized inference microservices for deploying embedding, re-ranking and LLMs on NVIDIA GPUs.
GPU Acceleration for Vector Operations: The Blueprint leverages NVIDIA GPUs to accelerate operations, including:
- Vector embedding generation
- Similarity search computations
- Document processing tasks
This acceleration delivers the performance needed for enterprise-scale implementations, significantly reducing processing time and improving query response times.
Adapting the Blueprint for Azure Deployment
To run this architecture on Azure, you map components as follows:
- Infrastructure Mapping: Map the Blueprint components to appropriate Azure services:
- Azure Kubernetes Service (AKS) for container orchestration
- Azure NetApp Files for high-performance storage
- Networking Configuration: Set up proper networking with VNet peering between AKS and Azure NetApp Files to ensure high-throughput data access.
- Resource Sizing: Configure appropriate GPU-enabled virtual machines and storage tiers to meet performance requirements.
This gives you the power of NVIDIA’s reference stack with the manageability and scale of Azure.
Azure NetApp Files: Powering High-Performance RAG Workloads
Azure NetApp Files is a fully managed, enterprise-grade file storage service that brings NetApp’s proven data management capabilities to the Azure ecosystem. Azure NetApp Files delivers the performance, reliability, and manageability you need for enterprise-grade RAG pipelines.
Why Azure NetApp Files works well for RAG
Azure NetApp Files provides a robust foundation for enterprise RAG workloads through its comprehensive feature set:
- Enterprise-grade reliability with 99.99% availability SLA
- Seamless integration with Azure services and existing applications
- Comprehensive security with encryption at rest and in transit
- Simplified management through the Azure portal and APIs
These capabilities ensure that RAG pipelines built on Azure NetApp Files meet enterprise requirements for reliability, security, and manageability.
Service Levels for RAG Workloads
Azure NetApp Files offers multiple service levels that can be aligned with different phases of the RAG pipeline:
- Standard: Delivers 16 MiB/s per terabyte of allocated capacity, suitable for initial data ingestion and document storage.
- Premium: Provides 64 MiB/s per terabyte, ideal for active document processing and embedding generation.
- Ultra: Offers 128 MiB/s per terabyte, ensuring maximum performance for high-throughput retrieval operations.
- Flexible: Allows independent provisioning of capacity and throughput for custom capacity/performance workload profiles.
This approach allows organizations to match storage performance to workload requirements, optimizing both performance and cost.
Dynamic Service Level Adjustments
One of the most powerful features of Azure NetApp Files for RAG workloads is the ability to change service levels dynamically without data migration:
- Volumes can be moved between tiers without moving data
- No application reconfiguration required
- No disruption to data access
- Changes take effect immediately
This capability enables organizations to adapt storage performance to changing workload demands, such as scaling up during intensive embedding generation phases and scaling down for steady-state operations.
Snapshot Capabilities for ML Versioning
Azure NetApp Files provides built-in snapshot functionality that supports critical versioning needs in ML pipelines:
- Create instant, space-efficient snapshots of data volumes
- Capture point-in-time versions of datasets, embeddings, and models
- Enable rapid rollback to previous states for reproducibility
- Support A/B testing of different model versions
- Integrate with orchestration frameworks through APIs
These capabilities ensure that the data and model states are properly versioned, supporting reproducibility and auditability requirements in enterprise AI systems.
Cost Optimization Strategies
Azure NetApp Files offers several approaches to optimize storage costs in RAG implementations:
- Right-sizing volumes based on actual data requirements
- Leveraging appropriate storage tiers for different workload phases
- Using cool access tiering for infrequently accessed data
- Employing space-efficient snapshots instead of full copies
- Utilizing reserved capacity options for predictable workloads
By implementing these strategies, organizations can achieve significant cost savings while maintaining the performance needed for effective RAG pipelines.
Azure Reference Architecture for Enterprise RAG
Let’s break down the main building blocks of the architecture and how they work together.
Azure Kubernetes Service (AKS) with NVIDIA GPU Nodes: AKS provides the orchestration layer for running containerized AI services. It supports GPU acceleration and scales to meet dynamic workloads. The AKS environment includes the following key elements:
- System nodes handle pipeline orchestration, logging, and operational control.
- GPU-enabled worker nodes (e.g., NC-series VMs with NVIDIA H100 or A100) run AI model workloads:
- NVIDIA NIM for embedding generation and LLM-based response generation.
- NVIDIA NeMo Retriever models for extracting content from documents, including text, tables, and visuals.
- NVIDIA Container Runtime enables GPU acceleration within Kubernetes pods.
- Kubernetes operators manage deployment, scaling, and lifecycle of containerized services.
Azure NetApp Files: provides high-performance storage for document repositories, vector embeddings, and model artifacts. Key capabilities include:
- Flexible, Premium or Ultra storage volumes for active workloads
- Flexible or Standard storage for less performance-intensive storage
- Cross-region or cross-zone replication for high availability
- Snapshot management for versioning
Milvus Vector Database: Milvus is an open-source vector database, accelerated by NVIDIA cuVS, that provides efficient vector storage, similarity search, and metadata management for embeddings. In this architecture, it runs within the AKS cluster to support scalable and high-performance retrieval operations. While this implementation uses default storage configuration, we will show in a future blog post how to configure Milvus to leverage Azure NetApp Files for all storage requirements – using Persistent Volumes (PVs) backed by Azure NetApp Files NFS volumes for Milvus cluster components. This configuration provides significant performance benefits including high-throughput data access, dynamic performance scaling, and enterprise-grade reliability for vector operations.
Azure Networking Components: Azure networking ensures secure, high-throughput communication between services deployed in the architecture. The key components include:
- VNet peering between AKS and Azure NetApp Files
- Network Security Groups for traffic filtering
- Azure Private DNS for name resolution
End-to-End Workflow
Here’s how the end-to-end RAG pipeline workflow plays out:
- Document Ingestion: Enterprise documents are ingested from Azure NetApp Files volumes into the pipeline.
- Document Processing: AKS-hosted microservices process documents, extracting text, tables, and visual elements using NVIDIA NeMo Retriever models.
- Embedding Generation: The extracted content is transformed into high-dimensional vector embeddings using NVIDIA NIM microservices, which are GPU-accelerated for high throughput.
- Vector Storage: These embeddings, along with associated metadata, are stored in the Milvus vector database, where they are indexed for fast similarity-based retrieval.
- Query Processing: When users submit queries, the system first encodes the query into a vector using the same embedding models.
- Retrieval: Milvus performs a similarity search to retrieve the most relevant content chunks. These results may optionally be reranked for improved relevance.
- Response Generation: The retrieved context is injected into a prompt to the LLM, which then generates a coherent, context-aware response grounded in enterprise data.
- Result Delivery: The final response is returned to the user via the application interface, completing the RAG cycle.
This workflow ensures low-latency, accurate responses grounded in your proprietary enterprise data and supports scalable, multimodal processing suitable for enterprise use cases.
Implementation Guide: Building the Pipeline
Setup your Bash shell
# install az cli
$ curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
#
# install aks-preview extension
$ az extension add –name aks-preview
#
# install kubectl
$ curl -LO “https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl”
$ sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
#
# install helm
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
$ chmod 700 get_helm.sh
$ ./get_helm.sh
Don’t be concerned about the warning message from the aks-preview extension. The standard aks extension provides the default CUDA version, while aks-preview extension offers the latest CUDA version. That’s why we chose to use the aks-preview extension.
Setup your Azure Account
Make sure the region you intend to work in has at least 200 vCores of NCadsH100v5 or 120 vCores of NCADS_A100_v4. Click the pencil icon if you need to request more quota.
Create a regular AKS cluster in your desired region.
az login –use-device-code
az aks create -g $RESOURCE_GROUP-n $CLUSTER_NAME –location $LOCATION
Set environment variables
$ export RESOURCE_GROUP=rag-rg
$ export CLUSTER_NAME=rag-aks
$ export LOCATION=westus2
$ export SLICING_GPUNP=gpunp-slicing
$ export SLICING_GPUNP=gpunp
$ export NGC_API_KEY=”your_ngc_api_key”
$ export NVIDIA_API_KEY=”nvapi-*”
Create 2 GPU nodepools: The RAG Blueprint requires 9 GPUs. We can use time-slicing to reduce the requirement from 9 to 5. However, since not all pods are compatible with time-slicing, we need two GPU node pools: one with time-slicing enabled and one without.
Create 1st GPU nodepool (with time-slicing)
$ az aks nodepool add –resource-group $RESOURCE_GROUP –cluster-name $CLUSTER_NAME –name $SLICING_GPUNP –node-count 1 –skip-gpu-driver-install –node-vm-size Standard_NC24ads_A100_v4 –node-osdisk-size 1024 –max-pods 110
Example:
Setup Time-slicing
$ cat < time-slicing-config-fine.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config-fine
data:
a100-80gb: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
– name: nvidia.com/gpu
replicas: 6
EOF
$ kubectl create -n gpu-operator -f time-slicing-config-fine.yaml
$ kubectl patch clusterpolicies.nvidia.com/cluster-policy
-n gpu-operator –type merge
-p ‘{“spec”: {“devicePlugin”: {“config”: {“name”: “time-slicing-config-fine”}}}}’
$ kubectl label node
–selector=nvidia.com/gpu.product=NVIDIA-A100-PCIe-80GB
nvidia.com/device-plugin.config=a100-80gb
Create 2nd GPU nodepool (no time-slicing)
$ az aks nodepool add –resource-group $RESOURCE_GROUP –cluster-name $CLUSTER_NAME –name $NOSLICING_GPUNP –node-count 2 –skip-gpu-driver-install –node-vm-size Standard_NC48ads_A100_v4 –node-osdisk-size 1024 –max-pods 110
- Install local path storage:
$ kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml
$ kubectl get pods -n local-path-storage
$ kubectl get storageclass
$ kubectl patch storageclass local-path -p ‘{“metadata”: {“annotations”:{“storageclass.kubernetes.io/is-default-class”:”true”}}}’
- Install GPU Operator
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
$ helm repo update
$ helm install –create-namespace –namespace gpu-operator nvidia/gpu-operator –wait –generate-name
- Prepare Helm Chart: download and modify for time-slicing.
Download helm chart:
$ helm repo add nvidia-nim https://helm.ngc.nvidia.com/nim/nvidia/ –username=’$oauthtoken’ –password=$NGC_API_KEY
$ helm repo add nim https://helm.ngc.nvidia.com/nim/ –username=’$oauthtoken’ –password=$NGC_API_KEY
$ helm repo add nemo-microservices https://helm.ngc.nvidia.com/nvidia/nemo-microservices –username=’$oauthtoken’ –password=$NGC_API_KEY
$ helm repo add baidu-nim https://helm.ngc.nvidia.com/nim/baidu –username=’$oauthtoken’ –password=$NGC_API_KEY
$ helm repo update
$ git clone https://gitlab.com/NVIDIA-AI-Blueprints/rag.git
$ cd rag/deploy/helm
$ helm dependency update rag-server/charts/ingestor-server
$ helm dependency update rag-server
Modify helm chart:
# unzip the helm chart for modification
$ cd rag-server/charts
$ rm -rf ingestor-server
$ tar -xvf ingestor-server-v2.0.0.tgz
$ rm ingestor-server-v2.0.0.tgz
$ tar -xvf nim-llm-1.3.0.tgz
$ rm nim-llm-1.3.0.tgz
$ tar -xvf text-reranking-nim-1.3.0.tgz
$ rm text-reranking-nim-1.3.0.tgz
$ tar -xvf nvidia-nim-llama-32-nv-embedqa-1b-v2-1.5.0.tgz
$ rm nvidia-nim-llama-32-nv-embedqa-1b-v2-1.5.0.tgz
# Modify nodeSelector for nv-ingest
$ cd ingestor-server/charts/nv-ingest/charts
#
# for the following files
#
# nvidia-nim-paddleocr/values.yaml
# nvidia-nim-nemoretriever-graphic-elements-v1/values.yaml
# nvidia-nim-nemoretriever-page-elements-v2/values.yaml
# nvidia-nim-nemoretriever-table-structure-v1/values.yaml
# milvus/values.yaml
#
# change
# nodeSelector: {} # likely best to set this to `nvidia.com/gpu.present: “true”` depending on cluster setup
# into
# nodeSelector:
# nvidia.com/gpu.sharing-strategy: “time-slicing”
$ cd ../../../..
# Modify nodeSelector for frontend
$ kubectl get node
# write down the full node name of GPU nodepool without time slicing
#
# for the following files
#
# nvidia-nim-llama-32-nv-embedqa-1b-v2/values.yaml
# text-reranking-nim/values.yaml
#
# change
# nodeSelector: {} # likely best to set this to `nvidia.com/gpu.present: “true”` depending on cluster setup
# into
# nodeSelector:
# kubernetes.io/hostname:
#
# for the following files
#
# nim-llm/values.yaml
#
# change
# nodeSelector: {} # likely best to set this to `nvidia.com/gpu.present: “true”` depending on cluster setup
# into
# nodeSelector:
# kubernetes.io/hostname:
- Deploy Helm Chart:
$ kubectl create namespace rag
$ helm install rag -n rag rag-server/
–set imagePullSecret.password=$NVIDIA_API_KEY
–set ngcApiSecret.password=$NVIDIA_API_KEY
The following is the desired result: Ensure all extraction pods are located on the same node and share a single GPU. The nim-llm pod should use one node with 2 GPUs. The remaining two pods in the retrieval pipeline should share one node.
And all pods are up and running without error.
- Launch RAG:
# Port forwarding for UI
$ kubectl port-forward -n rag service/rag-frontend 3000:3000 –address 0.0.0.0
- If you are using port forwarding from your local Ubuntu or Mac machine, open http://localhost:3000 in your browser.
- If you are using port forwarding from your WSL, open http://:3000 in a browser on Windows. You can use ifconfig in WSL to find the IP address.
- Create work pod: A pod with an Azure NetApp Files NFS volume increased batch PDF ingestion significantly.
- Create an Azure NetApp Files NFS volume:
$ az provider register –namespace Microsoft.NetApp –wait
$ az netappfiles account create –resource-group $RESOURCE_GROUP –location $LOCATION –account-name $ANF_ACCOUNT_NAME
$ az netappfiles pool create –resource-group $RESOURCE_GROUP –location $LOCATION –account-name $ANF_ACCOUNT_NAME –pool-name $POOL_NAME –size $SIZE –service-level $SERVICE_LEVEL
$ az network vnet subnet create –resource-group $RESOURCE_GROUP –vnet-name $VNET_NAME –name $SUBNET_NAME –delegations “Microsoft.Netapp/volumes” –address-prefixes $ADDRESS_PREFIX
$ az netappfiles volume create –resource-group $RESOURCE_GROUP –location $LOCATION –account-name $ANF_ACCOUNT_NAME –pool-name $POOL_NAME –name “$VOLUME_NAME” –service-level $SERVICE_LEVEL –vnet $VNET_ID –subnet $SUBNET_ID –usage-threshold $VOLUME_SIZE_GIB –file-path $UNIQUE_FILE_PATH –protocol-types NFSv3
(!) Note Make sure that your AKS cluster’s virtual network (VNet) and subnet match the volume’s VNet and subnet. |
- Create a persistent volume claim for your Azure NetApp Files volume:
$ cat < pv-nfs.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-nfs
spec:
capacity:
storage: 100Gi
accessModes:
– ReadWriteMany
mountOptions:
– vers=3
nfs:
server: 10.0.0.4 # Go to Azure portal to check the mountpoint of
path: /myfilepath2 # your azure netapp files volume
EOF
$ kubectl apply -f pv-nfs.yaml
$ cat < pvc-nfs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-nfs
spec:
accessModes:
– ReadWriteMany
storageClassName: “”
resources:
requests:
storage: 100Gi
EOF
$ kubectl apply -f pvc-nfs.yaml
- Create and mount the Azure NetApp Files NFS volume with the work pod:
$ cat < nv-ingest-test.yaml
kind: Pod
apiVersion: v1
metadata:
name: nv-ingest-test
spec:
containers:
– image: ubuntu
name: nv-ingest-test
command:
– “/bin/sh”
– “-c”
– while true; do echo $(date) >> /mnt/azure/outfile; sleep 1; done
volumeMounts:
– name: disk01
mountPath: /mnt/azure
volumes:
– name: disk01
persistentVolumeClaim:
claimName: pvc-nfs
EOF
$ kubectl apply -f nv-ingest-test.yaml
- Set up the work pod:
$ kubectl exec -it nv-ingest-test — bash
root@nv-ingest-test:/# apt-get update
root@nv-ingest-test:/# apt-get install python3-pip
root@nv-ingest-test:/# pip install nv-ingest-client –break-system-packages
root@nv-ingest-test:/# apt-get install python3-pypdf2
root@nv-ingest-test:/# cd ~
root@nv-ingest-test:~# cat < ingestion.py
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the “License”);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
“””
This script is used to ingest a corpus of documents into a vector database.
Requirements:
“`
pip install -r requirements.txt
python3 -c “import nltk; nltk.download(‘punkt’)”
“`
Usage:
python ingestion.py [options]
Options:
–nv_ingest_host HOST Host where nv-ingest-ms-runtime is running (default: localhost)
–nv_ingest_port PORT REST port for NV Ingest (default: 7670)
–milvus_uri URI Milvus URI for external ingestion (default: http://localhost:19530)
–minio_endpoint ENDPOINT MinIO endpoint for external ingestion (default: localhost:9010)
–collection_name NAME Name of the collection (default: bo767_test)
–folder_path PATH Path to the data files (default: “/path/to/bo767/corpus/”)
Example:
python ingestion.py –nv_ingest_host localhost –nv_ingest_port 7670 –milvus_uri http://localhost:19530 –minio_endpoint localhost:9010 –collection_name bo767_test –folder_path “/path/to/bo767/corpus/”
“””
import os
import time
import argparse
import PyPDF2
from tqdm import tqdm
from nv_ingest_client.client import Ingestor, NvIngestClient
def parse_args():
“””
Parse the arguments for the ingestion script.
“””
parser = argparse.ArgumentParser(
description=’Ingest a corpus of documents into a vector database’,
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=’For more information, see the example usage above.’
)
parser.add_argument(‘–nv_ingest_host’, type=str, default=”localhost”,
help=’Host where nv-ingest-ms-runtime is running’)
parser.add_argument(‘–nv_ingest_port’, type=int, default=7670,
help=’REST port for NV Ingest’)
parser.add_argument(‘–milvus_uri’, type=str, default=”http://localhost:19530″,
help=’Milvus URI for external ingestion’)
parser.add_argument(‘–minio_endpoint’, type=str, default=”localhost:9010″,
help=’MinIO endpoint for external ingestion’)
parser.add_argument(‘–collection_name’, type=str, default=”bo767_test”,
help=’Name of the collection’)
parser.add_argument(‘–folder_path’, type=str, default=”/path/to/bo767/corpus/”,
help=’Path to the data files’)
parser.add_argument(‘–skip_vdb_upload’, action=’store_true’,
help=’Skip the vector database upload’)
return parser.parse_args()
# Parse the arguments
args = parse_args()
# Print the configuration
print(“n” + “=”*80)
print(“PERFORMING INGESTION WITH THE FOLLOWING CONFIGURATION:”)
print(“=”*80)
print(f”NV-Ingest Host: {args.nv_ingest_host}”)
print(f”NV-Ingest Port: {args.nv_ingest_port}”)
print(f”Milvus URI: {args.milvus_uri}”)
print(f”MinIO Endpoint: {args.minio_endpoint}”)
print(f”Collection Name: {args.collection_name}”)
print(f”Folder Path: {args.folder_path}”)
print(“-“*80 + “n”)
# Compute the total number of pages of all the pdfs in the folder
print(“nComputing the total number of pages of all the pdfs in the folder…”)
total_pages = 0
pdf_count = 0
for file in tqdm(os.listdir(args.folder_path)):
try:
if file.endswith(“.pdf”) or file.endswith(“.txt”):
pdf_count += 1
pdf_path = os.path.join(args.folder_path, file)
with open(pdf_path, “rb”) as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
total_pages += len(reader.pages)
except Exception as e:
print(f”Error processing {file}: {e}”)
print(f”Total PDF files: {pdf_count}”)
print(f”Total pages in all PDFs: {total_pages}”)
# Server Mode
client = NvIngestClient(
# host.docker.internal (from inside docker)
message_client_hostname=args.nv_ingest_host, # Host where nv-ingest-ms-runtime is running
message_client_port=args.nv_ingest_port # REST port, defaults to 7670
)
# Create the ingestor instance
ingestor = Ingestor(client=client)
# Add the files to the ingestor
ingestor = ingestor.files(os.path.join(args.folder_path, “*”))
# Extract the text, tables, charts, images from the files
ingestor = ingestor.extract(
extract_text=True,
extract_tables=True,
extract_charts=True,
extract_images=False,
text_depth=”page”,
paddle_output_format=”markdown”
)
# Split the text into chunks
ingestor = ingestor.split(
tokenizer=”intfloat/e5-large-unsupervised”,
chunk_size=512,
chunk_overlap=150,
params={“split_source_types”: [“PDF”, “text”]}
)
# Embed the chunks
ingestor = ingestor.embed()
# Upload the chunks to the vector database
if not args.skip_vdb_upload:
print(“nAdding task to upload the chunks to the vector database…”)
ingestor = ingestor.vdb_upload(
collection_name=args.collection_name,
milvus_uri=args.milvus_uri,
minio_endpoint=args.minio_endpoint,
sparse=False,
enable_images=False,
recreate=False,
dense_dim=2048
)
else:
print(“nSkipping vector database upload…”)
print(“nStarting ingestion…”)
# Ingest the chunks
start = time.time()
# results blob is directly inspectable
results = ingestor.ingest(show_progress=True)
total_ingestion_time = time.time() – start
# Get count of result elements
print(“nCounting the number of result elements…”)
result_elements_count = 0
for result in tqdm(results):
for result_element in result:
result_elements_count += 1
print(“n” + “=”*80)
print(“INGESTION PERFORMANCE METRICS:”)
print(“=”*80)
print(f”Total ingestion time: {total_ingestion_time:.2f} seconds”)
print(f”Total pages ingested: {total_pages}”)
print(f”Pages per second: {total_pages / total_ingestion_time:.2f}”)
print(f”Total result files: {len(results)}”)
print(f”Total result elements/chunks: {result_elements_count}”)
print(“-” * 80)
EOF
- Use the work pod to ingest PDF files:
There are two ways to ingest PDF files: using the nv-ingest CLI client or the nv-ingest Python client. With the CLI client, the output is saved to a local output directory, and you’ll need to manually push the results to the vector database. In contrast, the Python client automatically pushes the results to the vector database for you. The following code snippet demonstrates both methods.
# Download dataset.zip from https://www.kaggle.com/datasets/manisha717/dataset-of-pdf-files
$ kubectl get svc -n rag
#
# Write down ip of nv-ingest service
#
$ kubetl cp dataset.zip nv-ingest-test:/tmp
$ kubectl exec -it nv-ingest-test — bash
root@nv-ingest-test:/# cp /tmp/dataset.zip /mnt/azure
root@nv-ingest-test:/# cd /mnt/azure
root@nv-ingest-test:/mnt/azure# unzip dataset.zip
root@nv-ingest-test:/mnt/azure# cd Pdf
#
# CLI Client method
#
root@nv-ingest-test:/mnt/azure/Pdf# nv-ingest-cli –doc “./*.pdf” –output_directory ./processed_docs –task=’extract:{“document_type”: “pdf”, “extract_text”: true, “extract_images”: true, “extract_tables”: true, “extract_charts”: true}’ –client_host= –client_port=7670
# It is normal that some file failed. it will spend about 30 minutes to finish.#
# Python Client method
#
root@nv-ingest-test:~# kubectl get svc -n rag
# Host where nv-ingest-ms-runtime is running (default: localhost)
# REST port for NV Ingest (default: 7670)
# Milvus URI for external ingestion (default: http://localhost:19530)
# MinIO endpoint for external ingestion (default: localhost:9010)
# Name of the collection (default: bo767_test)
# Path to the data files (default: “/path/to/bo767/corpus/”)
root@nv-ingest-test:~# python3 ingestion.py
–nv_ingest_host HOST
–nv_ingest_port PORT
–milvus_uri URI
–minio_endpoint ENDPOINT
–collection_name NAME
–folder_path PATH
Evaluating Your Enterprise RAG Pipeline
Before scaling a RAG solution across the enterprise, it’s important to know how well it performs. That means going beyond intuition and using a structured evaluation framework that covers accuracy, speed, efficiency, cost, and operational reliability.
In this post, we’ll introduce the key dimensions of that framework. In future blog posts, we’ll go deeper into each one through benchmarking, performance tuning, and real-world testing.
Retrieval Accuracy
Let’s start with the heart of any RAG system: how accurately it retrieves relevant content.
Metrics like Precision@K, Recall@K, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (nDCG) help quantify both the relevance and the order of results returned. These numbers tell us how well the system finds useful content and whether it ranks it in a way that helps the model generate accurate responses.
We’ll be benchmarking these metrics using real-world, human-annotated datasets across domains like finance, healthcare, and legal to see how different retrieval strategies stack up.
Latency and Throughput
Performance isn’t just about correctness. It’s also about speed, especially for interactive use cases like chat or search.
We’ll measure end-to-end latency, analyze how much time is spent in each stage of the pipeline (embedding, retrieval, generation), and test how well the system handles load. Metrics like query throughput and concurrent user support will help determine how scalable the system is under pressure.
GPU Utilization
GPUs are powerful, but they’re also expensive. To keep costs manageable, we need to measure how effectively they’re being used.
We’ll track metrics like average GPU utilization, memory usage, batching efficiency, and multi-GPU scaling performance. The goal is to identify areas where tuning the pipeline can improve performance without overprovisioning hardware.
Cost Analysis
Cost plays a huge role in determining whether a solution is viable in the long run. We’ll look at cost per document processed, cost per query, and compare different storage and compute configurations. This helps make informed trade-offs between performance and cost.
What’s Coming Next
This is just the start. In upcoming posts, we’ll take a closer look at benchmarking Milvus retrieval performance, tuning embedding pipelines, and evaluating domain-specific RAG use cases in sectors like healthcare.
We’ll also explore cost optimization strategies, GPU scaling techniques, and ways to improve user experience with lower latency and more accurate responses.
Our goal is to give you the tools and data you need to build and refine a RAG system that performs well, scales with your needs, and stays within budget.
Enterprise Use Cases and Real-World Applications
The true impact of a RAG system lies in solving specific enterprise problems where traditional tools fall short. Below are examples of how this reference architecture addresses critical pain points across industries.
Enterprise Search
The problem: Employees waste hours searching across disconnected systems, scanning lengthy documents or using ineffective keyword-based search that misses important context.
The solution: A RAG-powered enterprise search system understands natural language queries, retrieves relevant content based on semantic similarity, and presents synthesized answers. It connects disparate data sources and surfaces precise insights in seconds, improving productivity and enabling faster decisions.
Customer Support
The problem: Support agents struggle to find answers hidden in complex documentation, while customers grow frustrated by limited self-service tools and slow response times.
The solution: RAG systems automatically retrieve relevant information from support databases, manuals, and historical tickets. Agents receive real-time recommendations, and customers can access AI-driven self-service portals that provide instant, accurate responses. This shortens resolution times, boosts first-call resolution, and enhances satisfaction.
Regulatory Compliance and Document Processing
The problem: Staying compliant in highly regulated industries requires constant review of policies, regulations, and documentation—an error-prone and time-consuming task.
The solution: A RAG pipeline can extract, analyze, and cross-reference content from regulatory documents, internal policies, and operational logs. It flags gaps, supports policy drafting, and ensures alignment with regulatory changes. This reduces risk, increases auditability, and lowers compliance costs.
Financial Services
The problem: Analysts spend significant time reviewing reports, filings, and market data scattered across siloed sources, delaying insights and increasing operational costs.
The solution: RAG pipelines accelerate investment research, automate due diligence, and provide risk insights by pulling from structured and unstructured financial documents. These systems support personalized wealth management and fraud detection workflows with greater efficiency and depth.
Healthcare
The problem: Physicians and researchers are overwhelmed by vast volumes of medical literature, patient records, and guidelines, making it difficult to access timely, relevant information.
The solution: RAG systems summarize patient histories, surface clinical guidelines, and retrieve medical literature that supports diagnosis and treatment. They also enhance research by synthesizing findings across studies. This reduces medical errors, improves outcomes, and accelerates innovation.
Legal and Professional Services
The problem: Legal teams deal with thousands of pages of case law, contracts, and compliance documents, requiring extensive time and effort to extract relevant details.
The solution: A RAG pipeline helps identify key clauses, precedents, and risks by analyzing legal documents semantically. It accelerates due diligence, reduces manual review, and enhances decision-making with contextual insights.
Wrapping Up
If you’re planning to build a Retrieval-Augmented Generation pipeline at enterprise scale, combining Microsoft Azure, NVIDIA’s RAG Blueprint, and Azure NetApp Files gives you a production-ready framework. NVIDIA delivers accelerated AI through its RAG Blueprint running in Azure, and Azure NetApp Files provides high-performance, enterprise-grade storage. This setup solves key challenges like extracting insights from multimodal enterprise content, delivering low-latency responses, and maintaining enterprise-grade security and compliance.
It connects powerful large language models to your company’s proprietary data and gives you the scale, performance, and reliability needed for production workloads.
Here’s what this solution delivers:
- It scales with your data and users
- It’s fast, thanks to accelerated computing and high-performance storage
- It’s flexible enough to handle different types of content
- It helps control costs by using the right resources at the right time
- And it’s built with security and compliance in mind
As generative AI becomes more mainstream, RAG is emerging as a practical way to get real value from it, especially when you need answers based on your own knowledge rather than only what the model was trained on.
So if you’re exploring how to bring AI into your business without giving up control over your data or your costs, this architecture gives you a solid, proven path forward.