Cloud forensics: Prepare for the worst -implement security baselines for forensic readiness in Azure
August 11, 2025Protect against SharePoint CVE-2025-53770 with Azure Web Application Firewall (WAF)
August 11, 2025Why Page Numbers Matter in AI Search
When users search for content within large PDFs—such as contracts, manuals, or reports—they often need to know not just what was found, but where it was found. Associating search results with page numbers enables:
- Contextual navigation within documents.
- Precise citations in knowledge bases or chatbots.
- Improved user trust in AI-generated responses.
Prerequisites for Azure Blob Storage & Azure AI Search Setup Summary
1. Azure Blob Storage
A container is configured to store PDF files.
2. Appropriate permissions:
The AI search service must have Storage Blob Data Reader access to the container. If using RBAC, ensure the managed identity is properly assigned.
Ref:
Technical Approaches to Extract Page Numbers using AI search
1. Adding a Skillset: Document Cracking and index projection for parent-child indexing
The first step in skillset execution is document cracking, which separates text and image content. A common use case for Text Merger is merging the textual representation of images—such as OCR output or image captions—into the content field of a document. This is especially useful for PDFs or Word documents that combine text with embedded images. This ensures that the final enriched document includes all relevant textual data, regardless of its original format, and improves the accuracy of downstream search and analysis.
an index projection specifies how parent-child content is mapped to fields in a search index for one-to-many indexing.
{
“@odata.etag”: “”0x8DDD58F12B5D0B9″”,
“name”: “pagenumskillset”,
“description”: “Skillset to feed document to OCR skill and use Index Projection to split the content page wise”,
“skills”: [
{
“@odata.type”: “#Microsoft.Skills.Vision.OcrSkill”,
“name”: “#1”,
“context”: “/document/normalized_images/*”,
“lineEnding”: “Space”,
“defaultLanguageCode”: “en”,
“detectOrientation”: true,
“inputs”: [
{
“name”: “image”,
“source”: “/document/normalized_images/*”,
“inputs”: []
}
],
“outputs”: [
{
“name”: “text”,
“targetName”: “text”
},
{
“name”: “layoutText”,
“targetName”: “layoutText”
}
]
}
],
“cognitiveServices”: {
“@odata.type”: “#Microsoft.Azure.Search.DefaultCognitiveServices”
},
“indexProjections”: {
“selectors”: [
{
“targetIndexName”: “pagenumidx”,
“parentKeyFieldName”: “ParentKey”,
“sourceContext”: “/document/normalized_images/*”,
“mappings”: [
{
“name”: “DocText”,
“source”: “/document/normalized_images/*/text”,
“inputs”: []
},
{
“name”: “DocName”,
“source”: “/document/metadata_storage_name”,
“inputs”: []
},
{
“name”: “DocURL”,
“source”: “/document/metadata_storage_path”,
“inputs”: []
},
{
“name”: “PageNum”,
“source”: “/document/normalized_images/*/pageNumber”,
“inputs”: []
}
]
}
],
“parameters”: {
“projectionMode”: “skipIndexingParentDocuments”
}
}
}
“@odata.etag”: “”0x8DDD58F12B5D0B9″”,
“name”: “pagenumskillset”,
“description”: “Skillset to feed document to OCR skill and use Index Projection to split the content page wise”,
“skills”: [
{
“@odata.type”: “#Microsoft.Skills.Vision.OcrSkill”,
“name”: “#1”,
“context”: “/document/normalized_images/*”,
“lineEnding”: “Space”,
“defaultLanguageCode”: “en”,
“detectOrientation”: true,
“inputs”: [
{
“name”: “image”,
“source”: “/document/normalized_images/*”,
“inputs”: []
}
],
“outputs”: [
{
“name”: “text”,
“targetName”: “text”
},
{
“name”: “layoutText”,
“targetName”: “layoutText”
}
]
}
],
“cognitiveServices”: {
“@odata.type”: “#Microsoft.Azure.Search.DefaultCognitiveServices”
},
“indexProjections”: {
“selectors”: [
{
“targetIndexName”: “pagenumidx”,
“parentKeyFieldName”: “ParentKey”,
“sourceContext”: “/document/normalized_images/*”,
“mappings”: [
{
“name”: “DocText”,
“source”: “/document/normalized_images/*/text”,
“inputs”: []
},
{
“name”: “DocName”,
“source”: “/document/metadata_storage_name”,
“inputs”: []
},
{
“name”: “DocURL”,
“source”: “/document/metadata_storage_path”,
“inputs”: []
},
{
“name”: “PageNum”,
“source”: “/document/normalized_images/*/pageNumber”,
“inputs”: []
}
]
}
],
“parameters”: {
“projectionMode”: “skipIndexingParentDocuments”
}
}
}
2. Defining Index definition
An index is defined by a schema and stored within the search service, ensuring millisecond response times by decoupling from external data sources. Except for indexer-driven scenarios, the search service never queries the original data directly making it ideal for high-performance search applications.
{
“@odata.etag”: “”0x8DDD58E7F7CF595″”,
“name”: “pagenumidx”,
“fields”: [
{
“name”: “ID”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: false,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: true,
“analyzer”: “keyword”,
“synonymMaps”: []
},
{
“name”: “DocText”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: false,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: false,
“synonymMaps”: []
},
{
“name”: “DocName”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: true,
“facetable”: true,
“key”: false,
“synonymMaps”: []
},
{
“name”: “PageNum”,
“type”: “Edm.Int32”,
“searchable”: false,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: true,
“facetable”: false,
“key”: false,
“synonymMaps”: []
},
{
“name”: “DocURL”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: false,
“analyzer”: “standard.lucene”,
“synonymMaps”: []
},
{
“name”: “ParentKey”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: false,
“analyzer”: “keyword”,
“synonymMaps”: []
}
],
“scoringProfiles”: [],
“suggesters”: [],
“analyzers”: [],
“normalizers”: [],
“tokenizers”: [],
“tokenFilters”: [],
“charFilters”: [],
“similarity”: {
“@odata.type”: “#Microsoft.Azure.Search.BM25Similarity”
},
“semantic”: {
“defaultConfiguration”: “my_semantic_cfg”,
“configurations”: [
{
“name”: “my_semantic_cfg”,
“flightingOptIn”: false,
“rankingOrder”: “BoostedRerankerScore”,
“prioritizedFields”: {
“titleField”: {
“fieldName”: “DocName”
},
“prioritizedContentFields”: [
{
“fieldName”: “DocText”
}
],
“prioritizedKeywordsFields”: [
{
“fieldName”: “DocName”
},
{
“fieldName”: “ID”
}
]
}
}
]
}
}
“@odata.etag”: “”0x8DDD58E7F7CF595″”,
“name”: “pagenumidx”,
“fields”: [
{
“name”: “ID”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: false,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: true,
“analyzer”: “keyword”,
“synonymMaps”: []
},
{
“name”: “DocText”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: false,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: false,
“synonymMaps”: []
},
{
“name”: “DocName”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: true,
“facetable”: true,
“key”: false,
“synonymMaps”: []
},
{
“name”: “PageNum”,
“type”: “Edm.Int32”,
“searchable”: false,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: true,
“facetable”: false,
“key”: false,
“synonymMaps”: []
},
{
“name”: “DocURL”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: false,
“analyzer”: “standard.lucene”,
“synonymMaps”: []
},
{
“name”: “ParentKey”,
“type”: “Edm.String”,
“searchable”: true,
“filterable”: true,
“retrievable”: true,
“stored”: true,
“sortable”: false,
“facetable”: false,
“key”: false,
“analyzer”: “keyword”,
“synonymMaps”: []
}
],
“scoringProfiles”: [],
“suggesters”: [],
“analyzers”: [],
“normalizers”: [],
“tokenizers”: [],
“tokenFilters”: [],
“charFilters”: [],
“similarity”: {
“@odata.type”: “#Microsoft.Azure.Search.BM25Similarity”
},
“semantic”: {
“defaultConfiguration”: “my_semantic_cfg”,
“configurations”: [
{
“name”: “my_semantic_cfg”,
“flightingOptIn”: false,
“rankingOrder”: “BoostedRerankerScore”,
“prioritizedFields”: {
“titleField”: {
“fieldName”: “DocName”
},
“prioritizedContentFields”: [
{
“fieldName”: “DocText”
}
],
“prioritizedKeywordsFields”: [
{
“fieldName”: “DocName”
},
{
“fieldName”: “ID”
}
]
}
}
]
}
}
3.Using Azure AI Search Indexer with OCR and ImageAction
Azure AI Search allows you to extract page-level data by configuring the indexer with:
This setting renders each PDF page as a separate image, which can then be processed using the OcrSkill. The OCR output can be mapped to a collection field, where each item corresponds to a page’s text. This method enables you to infer page numbers based on the position of matched content in the collection.
{
“@odata.context”: “https://searchinstancename.search.windows.net/$metadata#indexers/$entity”,
“@odata.etag”: “”0x8DDD58F3760067D””,
“name”: “indexer-pagenum”,
“description”: null,
“dataSourceName”: “azureblob-1754401027271-datasource”,
“skillsetName”: “pagenumskillset”,
“targetIndexName”: “pagenumidx”,
“disabled”: null,
“schedule”: null,
“parameters”: {
“batchSize”: null,
“maxFailedItems”: null,
“maxFailedItemsPerBatch”: null,
“configuration”: {
“dataToExtract”: “contentAndMetadata”,
“parsingMode”: “default”,
“imageAction”: “generateNormalizedImagePerPage”,
“pdfTextRotationAlgorithm”: “none”
}
},
“fieldMappings”: [],
“outputFieldMappings”: [],
“cache”: null,
“encryptionKey”: null
}
“@odata.context”: “https://searchinstancename.search.windows.net/$metadata#indexers/$entity”,
“@odata.etag”: “”0x8DDD58F3760067D””,
“name”: “indexer-pagenum”,
“description”: null,
“dataSourceName”: “azureblob-1754401027271-datasource”,
“skillsetName”: “pagenumskillset”,
“targetIndexName”: “pagenumidx”,
“disabled”: null,
“schedule”: null,
“parameters”: {
“batchSize”: null,
“maxFailedItems”: null,
“maxFailedItemsPerBatch”: null,
“configuration”: {
“dataToExtract”: “contentAndMetadata”,
“parsingMode”: “default”,
“imageAction”: “generateNormalizedImagePerPage”,
“pdfTextRotationAlgorithm”: “none”
}
},
“fieldMappings”: [],
“outputFieldMappings”: [],
“cache”: null,
“encryptionKey”: null
}
Validation and Conclusion
{
“@search.score”: 1,
“ID”: “”,
“DocText”: “the PDF content “,
“DocName”: “docname.pdf”,
“PageNum”: 33,
“DocURL”: “https://storageaccoutname.blob.core.windows.net/containername/docname.pdf”,
“ParentKey”: “sammple key”
}
You can leverage Search Explorer to view the output which will look like below:
{
“@search.score”: 1,
“ID”: “”,
“DocText”: “the PDF content “,
“DocName”: “docname.pdf”,
“PageNum”: 33,
“DocURL”: “https://storageaccoutname.blob.core.windows.net/containername/docname.pdf”,
“ParentKey”: “sammple key”
}
Hope this help in your requirement of getting Page Number from PDF using AI search