Listening at Scale: Using Gen AI to Understand 10,000 Voices

Inspection Patterns in Hub-and-Spoke and vWAN Architectures

July 9, 2025

Enhancing Microsoft 365 security by eliminating high-privilege access

July 9, 2025

Published by azurefeeds on July 9, 2025

Step 1: Getting Familiar with the Data

Before diving into large-scale processing, I needed to understand what I was working with. The dataset—over 10,000 public comments—was rich, diverse, and completely unstructured. Some responses were just a few words. Others were sprawling essays, clocking in at over 40,000 words. That’s roughly eight weeks of reading—if you went cover to cover.

As someone outside the policy domain, I turned to Copilot to help me make sense of it.

Asking the Right Questions with Copilot’s Researcher Agent.

To simulate how a policy analyst might approach this data, I asked Copilot’s Researcher Agent:

You are a policy analyst chartered with reviewing 10,000+ comments from the public about a newly proposed US Government regulatory policy. Please give me 20 example questions you would be looking to answer based on the comments you received.

The result was a thoughtful response that concluded with a table that looked like the following:

Dimension	Question
Overall Sentiment	What proportion of comments express overall support, opposition, or neutrality toward the proposal?
Top Praises	Which specific provisions or impacts are most frequently praised?
Top Criticisms	Which specific provisions or impacts are most frequently criticized?
Stakeholder Breakdown	How do perspectives vary by commenter type (e.g., industry, academic, individual citizen)?
Geographic Distribution	Are there geographic patterns in support or opposition (regions, states, urban vs. rural)?
Sectoral Impact	Which economic sectors (e.g., energy, manufacturing, agriculture) are mentioned most, and how do their reactions differ?
Equity & Environmental Justice	To what extent do comments raise concerns about disproportionate impacts on low-income or minority communities?
Clarity & Understanding	Do commenters indicate confusion about any aspects of the rule (definitions, compliance steps, deadlines)?
Data & Evidence Requests	Are there recurring calls for additional data, studies, or cost-benefit analyses? If so, on what topics?
Legal & Constitutional Issues	What legal or constitutional challenges are suggested (e.g., federal overreach, preemption of state law)?
Implementation Feasibility	What practical implementation issues are raised (technology availability, timelines, staffing)?
Cost Concerns	What types of cost impacts (compliance costs, downstream consumer prices) are most frequently cited?
Alternative Approaches	What alternative regulatory approaches or policy instruments (e.g., incentives, voluntary programs) do commenters propose?
Enforcement & Monitoring	How do commenters view the proposed enforcement mechanisms? Are they seen as too weak or too burdensome?
Interactions with Other Regulations	Do comments note conflicts or synergies with existing federal, state, or international rules?
Anticipated Benefits	Which benefits (public health, environmental, economic) are highlighted most often?
Timing & Phasing	What suggestions are made regarding effective dates, phase-in periods, or grandfathering?
Technology & Innovation	Do commenters discuss impacts on technological innovation or competitive dynamics?
Public Awareness & Outreach	Are there calls for additional public education, guidance documents, or stakeholder workshops?
International Competitiveness & Alignment	How do stakeholders view impacts on U.S. competitiveness and alignment with international standards?
Suggested Text Changes	What specific wording changes, clarifications, or new sections do commenters recommend inserting or removing?

With a clearer understanding of the kinds of questions a policy analyst might ask, I was ready to start extracting structured insights. That meant designing a summarization prompt that could capture these dimensions consistently across thousands of responses.

Step 2: Designing a Summarization Prompt

Next, I worked with Copilot to design a prompt that could summarize, score, and structure each response into a standardized JSON object for future analysis. I started with this prompt:

I want to process each response using an LLM, to summarize and evaluate the response across multiple dimensions. What would you suggest I use as those dimensions and for my prompt to the chat completion LLM?

After a brief conversation with Copilot, refining some of the fields and referring to the table above, I ended up with the following evaluation prompt:

You are an Assistant analyzing a single RFI response for the AI Action Plan.
Please read the text below and produce a standardized JSON object following these rules:
1.Output only valid JSON (no extra commentary).
2.Use the following schema for the JSON object:
3.If any field is not applicable, use an empty string (‘”) or empty list ([]).
4.Do not include additional fields.
5.Make sure the final output is valid JSON and does not include any extra explanation.
6.Do not include any markdown or code blocks in the output.

`{
“responseTitle”: “Short descriptive title”,
“stakeholderType”: “One of [‘Academic’, ‘Industry (Tech)’, ‘Industry (Non-Tech)’, ‘Nonprofit’, ‘Individual’, ‘Government’, ‘Other’]”,
“stakeholders”: [ One or more standardized strings, e.g. ‘Google’, ‘NIST’, …]”,
“sentiment”: “One of [‘Positive’, ‘Negative’, ‘Mixed’, ‘Neutral’]”,
“topics”: [ One or more standardized strings, e.g. ‘Open Source’, ‘Healthcare’, …],
“policyCategories”: [ One or more standardized strings, e.g. ‘R&D’, ‘Education’, …],
“keyRecommendations”: [ “A list of free-form text entries, each capturing a major recommendation or request from the response” ],
“briefExcerpt”: “1-2 sentences quoted directly from the text that represent a key idea”,
“summary”: “A concise free-form summary of the submission (2-4 sentences)”
}`

RESPONSE to Analyze:

Of course, before I could apply this prompt at scale, I needed to get the raw text out of the original PDF files. That meant building a fast, reliable pipeline for converting PDFs into a format I could work with.

Step 3: Converting PDFs to Markdown

The original dataset was published as PDFs, so I needed to quickly extract the text from them. I used Microsoft’s open-source markitdown package because it is extremely lightweight and fast for basic text extraction. There are many ways to do this, including Azure AI Document Intelligence, which I would turn to as an enterprise-grade solution to extract information from complex layouts, tables, or forms within the PDFs. Since I was just concerned with text, markitdown would do the trick for now. This is an excerpt from the script that I used.

from pathlib import Path
from markitdown import MarkItDown

def convert_pdf_to_md(pdf_path):
md = MarkItDown(enable_plugins=False)
pdf = Path(pdf_path)
try:
content = md.convert(str(pdf)).text_content
md_file = pdf.with_suffix(“.md”)
md_file.write_text(content, encoding=”utf-8″)
print(f”Converted {pdf.name} → {md_file.name}”)
except Exception as e:
print(f”Failed to convert {pdf.name}: {e}”)

With the raw text successfully extracted from the PDFs, I was finally ready to put my summarization prompt to work. This next phase was all about testing, refining, and learning—quickly.

Step 4: Conversing with the Data – Iterating Toward Insight

Now that I had a way to extract the text from the PDFs, I was able to proceed with using my summarization prompt and an LLM to process responses and generate structured JSON summaries. I started with 1% (~100 random responses) of the dataset and using GPT-4o-mini with my initial prompt. GPT-4o-mini is an extremely efficient, cost-effective chat completion model that I use for almost all my early development and prototyping work. This is the excerpt of the script I used:

# content = markdown content; prompt_template = summarization prompt from above
prompt = prompt_template.replace(“”, content)

response = await asyncio.to_thread(

client.chat.completions.create, # client = Azure OpenAI Client
model=”gpt-4o-mini”,
messages=[{“role”: “user”, “content”: prompt}]
)

json_output = response.choices[0].message.content

With my evolving Python script, I processed 1% of the dataset on my laptop in under 2 minutes. This speed allowed me to quickly iterate, spot check the results, and develop my own mental model of the dataset itself. Reviewing the results allowed me to uncover other information I wanted to extract, such as Risks and a nuanced description of each response’s Stance to go along with Sentiment.

While reviewing the summaries, I realized I should also try to score the responses. I started by asking the LLM (i.e., adding to my prompt) to rate each response on Factual Accuracy and Argument Quality. From this, I noticed that many of the Individual Responses were scoring very low in both dimensions – they tended to be very short, often of Negative sentiment. I didn’t want to accidentally screen out this information, so I added another scoring dimension: Passion. I also asked the LLM to flag responses that may need human review due to a variety of factors, including threats of violence, self-harm, or illegal activity, or recommendations and risks identified were particularly novel.

After about an hour of test cycles and adjustments, I landed on the following final summarization and scoring prompt:

You are an Assistant analyzing a single RFI response to the AI Action Plan. Your task is to produce a single valid JSON object that summarizes and evaluates response content. Follow all rules strictly.

Instructions:

– Output ONLY valid JSON (no markdown, no explanations, no extra text).
– Follow the schema exactly. Do not add, remove, or rename any fields.
– If a field is not applicable or cannot be determined, use a blank string (“”), empty list ([]), false for booleans, or 0.0 for floats.
– Use only what is present in the response.
– Double-check for strict schema adherence and valid JSON format.
– Use double quotes for all strings, square brackets for all lists, and avoid any commentary.
– Set `humanReview.requiresHumanReview` to true if:
– The response includes novel or concerning content (e.g. legal threats, privacy, self-harm),
– There is any uncertainty in key fields,

Clarifications:

– ‘topics’: Specific subject matter or domains (e.g., “Privacy”, “Labor”).
– ‘policyCategories’: Policy action or approach types (e.g., “Regulation”, “R&D”).
– Do not duplicate entries between ‘topics’ and ‘policyCategories’.

Schema:

{
“responseTitle”: “Short descriptive title”,
“sentiment”: “One of [‘Positive’, ‘Negative’, ‘Mixed’, ‘Neutral’]”,
“summary”: “2-4 sentence free-form summary of the response content.”,
“stance”: “A single sentence summarizing the response’s position on the AI Action Plan.”,
“topics”: [“List of domain topics discussed”],
“policyCategories”: [“List of policy action categories”],
“keyRecommendations”: [ [“Short Title”, “Free-form recommendation text”] ],
“keyRisks”: [ [“Short Title”, “Free-form risk text”] ],
“briefExcerpt”: “One or two sentences quoted or paraphrased from the response.”,
“stakeholderInfo”: {
“type”: “One of [‘Academic’, ‘Industry’, ‘Nonprofit’, ‘Small-Business’, ‘Government’,

‘Individual’, ‘Other’]”,
“domain”: “One of [‘Tech’, ‘Healthcare’, ‘Finance’, ‘Academic’, ‘Other’]”,
“isAnonymous”: true or false
},
“stakeholders”: [
“List of stakeholder names or anonymized descriptors (e.g. ‘A small business owner’).
Do not include literal ‘Anonymous’.”
],
“argumentQuality”: { “score”: 0.0, “rationale”: “Short explanation.” },
“factualAccuracy”: { “score”: 0.0, “rationale”: “Short explanation.” },
“passionScore”: { “score”: 0.0, “rationale”: “Short explanation.” },
“humanReview”: { “requiresHumanReview”: false, “rationale”: “Short explanation” }
}

AI Action Plan Executive Order Text:

RESPONSE to Analyze:

With a solid prompt in hand, the next question was: which LLM would give me the best results? To find out, I ran a head-to-head evaluation of several models using a 5% sample of the dataset.

Step 5: Evaluating LLMs for use in processing the entire dataset

Before I was ready to process the entire dataset, I wanted to make a decision about which LLM I should be using to get the best quality summaries and scores. To do this, I used LLM-evaluation with a 5% random sample (~500 responses) testing response summaries from several LLMs using my final prompt above: GPT-4o-mini, GPT-4o, GPT-4.1, GPT-4.5, o1, and o3. I used o1 itself as my evaluation LLM, along with a judging prompt, to evaluate summaries across four dimensions: Sentiment, Summarization Quality, Recommendation Extraction, Argument Scoring.

Like my decision to use markitdown for pdf-to-markdown conversion, using LLM-evaluation to assess the quality of my summaries was primarily driven by speed of development and learning. In an enterprise scenario, I would turn to Azure AI Foundry’s built in evaluators—such as Content Safety, Groundedness, and IntentResolution—to help me do this on a larger, more reliable scale.

The results of this quick evaluation are shown in the following box-and-whiskers charts. On the left, I’m showing the distribution of the Summarization and Recommendation ratings. On the right, I’m showing the distribution of the Sentiment and Argument Quality ratings. I organized it this way because the ratings on the left are more textual in nature and the ratings on the right require more understanding of the text itself.

Why GPT-4.1 Won: GPT-4.1, a chat-completion model, consistently produced the most accurate, nuanced, and well-structured summaries. It did a fantastic job at extracting key recommendations and risks. It handled the multi-instruction prompts well, was cost-effective, and generated fewer malformed JSON outputs. GPT-4.1 did not perform as well as other models when it came to sentiment analysis and argument quality scoring, however.

And Why O3 Won: o3, as a reasoning model, proved far more adept at analyzing the source text to assess sentiment, argument quality, passion, and flagging for human review. O1 performed similarly, however it struggled to follow the rule to only output well-structured JSON. O3 did not do as well as the GPT-4x models for text summarization and idea extraction, however.

With this new knowledge in hand, I quickly updated my summarization script to use two separate LLM calls: one to GPT-4.1 for text summarization tasks and one to o3 for scoring tasks. Now, it was time to scale. Processing 10,000+ responses on my laptop wasn’t practical—so I turned to Azure Functions to take the pipeline to the cloud.

Step 6: Scaling to the Cloud with Azure Functions

While I probably could have processed all the responses on my laptop, I decided to move everything to Azure for scale, resilience and speed. I moved the core logic of my python script into two separate Azure Functions that could scale independently. A high-level architecture is depicted in the Figure.

The first Function, pdf_to_markdown, triggers on the ‘arrival’ of a PDF into Azure Blob Storage, quickly converts the PDF into Markdown, and stores it back to Azure Blob Storage.

The second Function, summarize_and_score, triggers on the arrival of a new Markdown file, uses gpt-4.1 and o3 models served by Azure AI Foundry to summarize and score the response, and stores the generated JSON document back to Azure Blob Storage and to Azure Cosmos DB for future processing.

Using this dataflow, converting PDFs to MD took less than a second on average, summarizing with gpt-4.1 took about 4 seconds on average, and scoring with o3 took about 17 seconds on average. I set my Azure Functions up to scale up to 30 instances, dropped all of the responses into Blob Storage, and set off to go have some lunch. Processing finished in about 90 minutes—though it could have been faster with more than 30 function instances.

Step 7: Initial Quantitative Observations

With the preprocessing pipeline deployed and run against all 10,000+ responses, I was able to review some of the quantitative metrics (word counts, scores, etc) to get some early insights. A few patterns started to emerge as well as confirmed some suspicions I noticed with even 1% of the dataset.

Sentiment skewed negative among anonymous & individual respondents and small-businesses, especially those expressing concern about AI’s societal impacts.

Argument quality varied widely, with institutional stakeholders generally delivering longer responses and scoring higher on structure and evidence, and skewing more positive but measured.

Passion scores, as suspected, helped surface emotionally charged responses that might otherwise have been dismissed due to low factual or argumentative strength. 86% of responses were categorized as Negative, had an average word count of 227 words, and had a passion score of 3.6/5.

These early metrics gave me a sense of the dataset’s emotional and structural landscape—but numbers alone don’t tell the full story. To truly understand what people are saying, we need to go beyond sentiment and scores.

What’s Next: From Individual Voices to Collective Insight

So far, we’ve taken 10,000+ unstructured public comments and transformed them into structured, scored, and quantifiable data using Gen AI and Azure Functions.

These early metrics offer a valuable macro-level view—but they don’t tell the full story. What’s driving these numbers? What are people recommending? What are they passionate about? What risks do they fear most?

In Blog Post 2, we’ll move from analysis to understanding. Using a knowledge graph powered by Microsoft’s GraphRAG research, we’ll connect themes, surface patterns, and uncover the deeper insights hidden in the data.

If this post was about listening to individual voices, the next one is about hearing the chorus.

Stay tuned.

Key Takeaways

Processed 4.5M words of public feedback—equivalent to 9-10 weeks of full-time reading—in under 4 hours of coding, iteration, and cloud-scale automation.

Transformed 10,000+ unstructured public comments into structured and scored data using Gen AI.

Designed and refined a custom summarization and scoring prompt to extract insights across 20+ policy dimensions.

Evaluated multiple LLMs (GPT-4.1, o3, and others) to optimize for summarization quality, sentiment accuracy, and scoring reliability.

Identified early trends in sentiment, argument quality, and passion across stakeholder types, laying the foundation for deeper thematic analysis.

Inspection Patterns in Hub-and-Spoke and vWAN Architectures

Enhancing Microsoft 365 security by eliminating high-privilege access

Inspection Patterns in Hub-and-Spoke and vWAN Architectures

Enhancing Microsoft 365 security by eliminating high-privilege access

Step 1: Getting Familiar with the Data

Asking the Right Questions with Copilot’s Researcher Agent.

Step 2: Designing a Summarization Prompt

Step 3: Converting PDFs to Markdown

Step 4: Conversing with the Data – Iterating Toward Insight

Step 5: Evaluating LLMs for use in processing the entire dataset

Step 6: Scaling to the Cloud with Azure Functions

Step 7: Initial Quantitative Observations

What’s Next: From Individual Voices to Collective Insight

Key Takeaways

Related posts

Securing Kubernetes Applications with Ingress Controller and Content Security Policy

Security Update for SQL Server 2019 RTM CU32

Security Update for SQL Server 2019 RTM GDR