Finding the Right Page number in PDFs with AI Search

Cloud forensics: Prepare for the worst -implement security baselines for forensic readiness in Azure

August 11, 2025

Protect against SharePoint CVE-2025-53770 with Azure Web Application Firewall (WAF)

August 11, 2025

Published by azurefeeds on August 11, 2025

Tags

Why Page Numbers Matter in AI Search

When users search for content within large PDFs—such as contracts, manuals, or reports—they often need to know not just what was found, but where it was found. Associating search results with page numbers enables:

Contextual navigation within documents.

Precise citations in knowledge bases or chatbots.

Improved user trust in AI-generated responses.

Prerequisites for Azure Blob Storage & Azure AI Search Setup Summary

1. Azure Blob Storage

A container is configured to store PDF files.

2. Appropriate permissions:

The AI search service must have Storage Blob Data Reader access to the container. If using RBAC, ensure the managed identity is properly assigned.

Ref:

AI search Search-blob-indexer-role-based-access

How to Index Azure Blobs

Technical Approaches to Extract Page Numbers using AI search

1. Adding a Skillset: Document Cracking and index projection for parent-child indexing

The first step in skillset execution is document cracking, which separates text and image content. A common use case for Text Merger is merging the textual representation of images—such as OCR output or image captions—into the content field of a document. This is especially useful for PDFs or Word documents that combine text with embedded images. This ensures that the final enriched document includes all relevant textual data, regardless of its original format, and improves the accuracy of downstream search and analysis.

an index projection specifies how parent-child content is mapped to fields in a search index for one-to-many indexing.

{
“@odata.etag”: “”0x8DDD58F12B5D0B9″”,
“name”: “pagenumskillset”,
“description”: “Skillset to feed document to OCR skill and use Index Projection to split the content page wise”,
“skills”: [
{
“@odata.type”: “#Microsoft.Skills.Vision.OcrSkill”,
“name”: “#1”,
“context”: “/document/normalized_images/*”,
“lineEnding”: “Space”,
“defaultLanguageCode”: “en”,
“detectOrientation”: true,
“inputs”: [
{
“name”: “image”,
“source”: “/document/normalized_images/*”,
“inputs”: []
}
],
“outputs”: [
{
“name”: “text”,
“targetName”: “text”
},
{
“name”: “layoutText”,
“targetName”: “layoutText”
}
]
}
],
“cognitiveServices”: {
“@odata.type”: “#Microsoft.Azure.Search.DefaultCognitiveServices”
},
“indexProjections”: {
“selectors”: [
{
“targetIndexName”: “pagenumidx”,
“parentKeyFieldName”: “ParentKey”,
“sourceContext”: “/document/normalized_images/*”,
“mappings”: [
{
“name”: “DocText”,
“source”: “/document/normalized_images/*/text”,
“inputs”: []
},
{
“name”: “DocName”,
“source”: “/document/metadata_storage_name”,
“inputs”: []
},
{
“name”: “DocURL”,
“source”: “/document/metadata_storage_path”,
“inputs”: []
},
{
“name”: “PageNum”,
“source”: “/document/normalized_images/*/pageNumber”,
“inputs”: []
}
]
}
],
“parameters”: {
“projectionMode”: “skipIndexingParentDocuments”
}
}
}

2. Defining Index definition

An index is defined by a schema and stored within the search service, ensuring millisecond response times by decoupling from external data sources. Except for indexer-driven scenarios, the search service never queries the original data directly making it ideal for high-performance search applications.

{
“@odata.etag”: “”0x8DDD58E7F7CF595″”,
“name”: “pagenumidx”,
“fields”: [
{
“name”: “ID”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: false,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: true,
“analyzer”: “keyword”,
“synonymMaps”: []
},
{
“name”: “DocText”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: false,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: false,
“synonymMaps”: []
},
{
“name”: “DocName”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: true,
“facetable”: true,
“key”: false,
“synonymMaps”: []
},
{
“name”: “PageNum”,
“type”: “Edm.Int32”,
“searchable”: false,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: true,
“facetable”: false,
“key”: false,
“synonymMaps”: []
},
{
“name”: “DocURL”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: false,
“analyzer”: “standard.lucene”,
“synonymMaps”: []
},
{
“name”: “ParentKey”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: false,
“analyzer”: “keyword”,
“synonymMaps”: []
}
],
“scoringProfiles”: [],
“suggesters”: [],
“analyzers”: [],
“normalizers”: [],
“tokenizers”: [],
“tokenFilters”: [],
“charFilters”: [],
“similarity”: {
“@odata.type”: “#Microsoft.Azure.Search.BM25Similarity”
},
“semantic”: {
“defaultConfiguration”: “my_semantic_cfg”,
“configurations”: [
{
“name”: “my_semantic_cfg”,
“flightingOptIn”: false,
“rankingOrder”: “BoostedRerankerScore”,
“prioritizedFields”: {
“titleField”: {
“fieldName”: “DocName”
},
“prioritizedContentFields”: [
{
“fieldName”: “DocText”
}
],
“prioritizedKeywordsFields”: [
{
“fieldName”: “DocName”
},
{
“fieldName”: “ID”
}
]
}
}
]
}
}

3.Using Azure AI Search Indexer with OCR and ImageAction

Azure AI Search allows you to extract page-level data by configuring the indexer with:

This setting renders each PDF page as a separate image, which can then be processed using the OcrSkill. The OCR output can be mapped to a collection field, where each item corresponds to a page’s text. This method enables you to infer page numbers based on the position of matched content in the collection.

{
“@odata.context”: “https://searchinstancename.search.windows.net/$metadata#indexers/$entity”,
“@odata.etag”: “”0x8DDD58F3760067D””,
“name”: “indexer-pagenum”,
“description”: null,
“dataSourceName”: “azureblob-1754401027271-datasource”,
“skillsetName”: “pagenumskillset”,
“targetIndexName”: “pagenumidx”,
“disabled”: null,
“schedule”: null,
“parameters”: {
“batchSize”: null,
“maxFailedItems”: null,
“maxFailedItemsPerBatch”: null,
“configuration”: {
“dataToExtract”: “contentAndMetadata”,
“parsingMode”: “default”,
“imageAction”: “generateNormalizedImagePerPage”,
“pdfTextRotationAlgorithm”: “none”
}
},
“fieldMappings”: [],
“outputFieldMappings”: [],
“cache”: null,
“encryptionKey”: null
}

Validation and Conclusion

You can leverage Search Explorer to view the output which will look like below:

{
“@search.score”: 1,
“ID”: “”,
“DocText”: “the PDF content “,
“DocName”: “docname.pdf”,
“PageNum”: 33,
“DocURL”: “https://storageaccoutname.blob.core.windows.net/containername/docname.pdf”,
“ParentKey”: “sammple key”
}

Hope this help in your requirement of getting Page Number from PDF using AI search

Cloud forensics: Prepare for the worst -implement security baselines for forensic readiness in Azure

Protect against SharePoint CVE-2025-53770 with Azure Web Application Firewall (WAF)

Cloud forensics: Prepare for the worst -implement security baselines for forensic readiness in Azure

Protect against SharePoint CVE-2025-53770 with Azure Web Application Firewall (WAF)

Related posts

SAP Sybase ASE to Azure SQL Migration using SSMA and BCP Overview

Logic Apps Community Day 2025

Access announces removal of Salesforce ODBC driver in October 2025