Resilience in action for Windows devices
July 23, 2025Azure VMware Solution now available in Spain Central
July 23, 2025In April 2025, the White House Office of Science and Technology Policy released over 10,000 public comments in response to its AI Action Plan. These comments—ranging from a few words to over 40,000—offer a rare and powerful snapshot of how Americans feel about the future of artificial intelligence.
But how do you make sense of 4.5 million diverse and opinionated, unstructured words? That’s where Gen AI comes in.
This blog series is for data scientists, software developers, and government officials—anyone looking to use AI not just for insight, but for efficiency. Whether you’re analyzing public feedback, internal reports, or customer input, the ability to turn massive volumes of text into actionable insight—fast—is a game-changer.
In the first post of this series, we explored how Gen AI can help us listen at scale—transforming over 10,000 public comments on the White House AI Action Plan into structured, scored summaries using LLMs and Azure Functions. I built a scalable pipeline that converted PDFs to markdown, summarized responses into JSON, and scored them across sentiment, argument quality, and passion.
Now, in Part 2, we move from listening to connecting. This post is about how I constructed and used a knowledge graph—powered by Microsoft’s GraphRAG research—to surface patterns, link ideas, and uncover deeper insights from the data.
Case study and prereleased product disclaimers: This document is for informational purposes only. Some information relates to pre-released product which may be substantially modified before it’s commercially released. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY AND WITH RESPECT TO THE INFORMATION PROVIDED HERE.
Microsoft GraphRAG
Standard RAG (Retrieval-Augmented Generation) excels at retrieving facts but often misses the big picture. It struggles to synthesize insights across documents or understand large text collections holistically.
Microsoft’s GraphRAG solves this by using LLMs and machine learning graph analysis to build a knowledge graph from your data. This graph acts as a map, helping the LLM connect concepts and answer complex questions more effectively than snippet-based search.
With GraphRAG, you can ask broad questions like “What themes emerge across these reports?” and get synthesized, evidence-backed answers. It’s especially powerful for narrative datasets—like comments or logs, or 10,000 public responses on the proposed AI Action Plan—where understanding trends matters as much as retrieving facts.
A quick comparison of GraphRAG and LazyGraphRAG: one builds a full knowledge graph, the other delivers similar insights with lower cost and setup.
Our latest innovation is LazyGraphRAG, a leaner, more efficient fork of GraphRAG. While full GraphRAG builds and summarizes the entire knowledge graph upfront, LazyGraphRAG combines vector and graph search on the fly and defers heavy LLM analysis until needed at query time. This hybrid strategy delivers GraphRAG-level insights at a fraction of the cost making it practical even under tight budget or latency constraints.
Note: LazyGraphRAG is still experimental and remains an internal-only fork of our GraphRAG research project. You can still achieve similar results using the open-source GraphRAG library today to explore hidden connections and turn massive text into actionable knowledge. I suggest you use fast indexing and DRIFT search.
Step 1: (More) Data Prep
To prepare the data for GraphRAG, I needed to perform one final preprocessing step. GraphRAG constructs the knowledge graph from unstructured text, and thus, expects a single text field per document it ingests. Since GraphRAG can embed metadata directly into the graph, I split each JSON document from Blog Post 1 into two parts: the core text used to build the graph, and the metadata I wanted embedded in the graph’s nodes and edges. The table below lists the structured fields I created in Blog Post 1 and how I’ve mapped them for indexing by GraphRAG.
Field | Description | Core text, Metadata |
---|---|---|
responseTitle | Short descriptive title < 8 words. | Metadata |
sentiment | Positive, Negative, Mixed, Neutral. | Metadata |
summary | 2–4 sentence overview of the response. | Core text |
stance | 1–2 sentences describing the stance expressed in the response. | Core text |
topics | List of specific subject matter or domains. | Metadata |
policyCategories | List of policy actions or approaches. | Metadata |
keyRecommendations | List of actionable suggestions with names and descriptions. | Core text |
keyRisks | List of concerns or warnings with names and descriptions. | Core text |
briefExcerpt | 1–2 key sentences quoted from the response. | Core text |
stakeholderInfo | Type, domain or market, and anonymity. | Metadata |
stakeholders | List of response authors including companies and individuals. | Both |
argumentQuality | Rating from 1 to 5 on quality of argument. | Metadata |
factualAccuracy | Rating from 1 to 5 on accuracy of presented facts. | Metadata |
passionScore | Rating from 1 to 5 on emotion of response. | Metadata |
humanReview | Yes/No for interesting or concerning responses. | Metadata |
To make the dataset GraphRAG-ready, I wrote a script that transformed each JSON document into an array of JSON documents, one for each text-based field, and replicated the relevant metadata across each entry to ensure it could be embedded in the graph. This approach allowed each key insight—whether a summary, recommendation, or risk—to become node clusters in the graph, improving granularity and enabling more precise traversal and retrieval. Here’s a brief before-and-after example.
Before
{
“fileName”: “AI-RFI-2025-ABCD.pdf”,
“responseTitle”: “Concern Over Lack of AI Copyright Protections”,
“sentiment”: “Negative”,
“summary”: “The submission expresses strong concern that …”,
“stance”: “The response strongly opposes any AI Action Plan because …”,
“topics”: [ “Copyright”, “Labor”, “National Security” ],
“policyCategories”: [ “Regulation”, “Security” ],
“keyRecommendations”: [
[ “AI Regulation”, “Require AI companies to obtain permission …” ],
[ “Copyright Protection”, “Ensure AI systems and their developers respect …” ]
],
“keyRisks”: [
[ “Job Loss”, “AI deployment without regulation could put …” ],
[ “Cultural Erosion”, “Allowing companies to use creative …” ]
],
“briefExcerpt”: “”By giving companies the ability to … “”,
“stakeholderInfo”: { “type”: “Individual”, “domain”: “Tech”, “isAnonymous”: false },
“stakeholders”: [“John Doe”],
“argumentQuality”: { “score”: 3, “rationale”: “The argument is clear …” },
“factualAccuracy”: { “score”: 2, “rationale”: “The claims about copyright theft …” },
“passionScore”: { “score”: 4, “rationale”: “The language is strongly …” },
“humanReview”: { “requiresHumanReview”: true, “rationale”: “The response …” }
}
After
[
{ # example summary ——————————————————-
“text”: “AI-RFI-2025-ABCD.pdf expresses strong concern that …”,
# metadata
“title”: “Concern Over Lack of AI Copyright Protections”,
“metadata”: {
“filename”: “AI-RFI-2025-ABCD.pdf “,
“sentiment”: “Negative”,
“topics”: [“Copyright”, “Labor”, “National Security”],
“policyCategories”: [ “Regulation”, “Security”],
“stakeholderInfo”: {“type”: “Individual”, “domain”: “Tech”, “isAnonymous”: false},
“stakeholders”: [“John Doe”],
“argumentQuality”: {“score”: 3},
“factualAccuracy”: {“score”: 2},
“passionScore”: {“score”: 4},
“humanReview”: {“requiresHumanReview”: true}
}
},
{ # example recommendation ————————————————
“text”: “AI-RFI-2025-ABCD.pdf recommends AI Regulation. Require AI companies …”,
# metadata
“title”: “Concern Over Lack of AI Copyright Protections”,
“metadata”: {
“filename”: “AI-RFI-2025-ABCD.pdf “,
“sentiment”: “Negative”,
“topics”: [ “Copyright”, “Labor”, “National Security”],
“policyCategories”: [ “Regulation”, “Security”],
“stakeholderInfo”: { “type”: “Individual”, “domain”: “Tech”, “isAnonymous”: false},
“stakeholders”: [“John Doe”],
“argumentQuality”: {“score”: 3},
“factualAccuracy”: {“score”: 2},
“passionScore”: {“score”: 4},
“humanReview”: {“requiresHumanReview”: true}
}
},
{ # example risk ———————————————————-
“text”: “AI-RFI-2025-ABCD.pdf sees a risk of Job Loss. AI deployment without …”,
# metadata
“title”: “Concern Over Lack of AI Copyright Protections”,
“metadata”: {
“filename”: “AI-RFI-2025-ABCD.pdf “,
“sentiment”: “Negative”,
“topics”: [“Copyright”, “Labor”, “National Security”],
“policyCategories”: [“Regulation”, “Security”],
“stakeholderInfo”: {“type”: “Individual”, “domain”: “Tech”, “isAnonymous”: false},
“stakeholders”: [“John Doe”],
“argumentQuality”: {“score”: 3},
“factualAccuracy”: {“score”: 2},
“passionScore”: {“score”: 4},
“humanReview”: {“requiresHumanReview”: true}
}
}, …
]
This structure allowed GraphRAG to treat the text as the primary source for entity and relationship extraction, while the metadata remained available for filtering, faceting, and graph enrichment.
Note: GraphRAG is entirely able to construct a knowledge graph from all the raw response text without any of the preprocessing in Blog Post 1. However, the knowledge graph would not be enriched with this additional metadata that will help during knowledge graph traversal, retrieval, and, ultimately, answer generation.
Step 2: Building a metadata-aware knowledge graph
With the data prepped and split into GraphRAG-ready JSON documents, the next step was configuring the indexer to ingest them properly. GraphRAG uses a YAML configuration file to define how it reads and processes input data. For this project, I needed to tell GraphRAG how to: (1) locate the JSON files, (2) identify the title and text fields, (3) embed metadata, and (4) handle chunking behavior. Here’s an excerpt from my updated configuration file:
input:
input:
file_type: json
file_pattern: “.*.json$$”
title_column: title
text_column: text
metadata: [metadata]
# additional input configurations… type, base_dir, etc.
chunks:
prepend_metadata: true
chunk_size_includes_metadata: false
# additional chunk configurations… size, overlap, etc.
Within the input block, I told GraphRAG to look for JSON documents and how to map the fields in those documents. Within the chunk block, these settings ensured the metadata would be preserved during chunking without inflating the chunk size calculations.
With this configuration in place, GraphRAG was able to ingest each JSON entry as a standalone document, extract entities and relationships from the text field, and embed the associated metadata directly into the graph. This setup ensured that every node and edge carried not just semantic meaning, but also contextual signals—like sentiment, stakeholder type, and policy relevance—that would later power more nuanced queries and visualizations.
Now that the indexer was configured, I could run the indexer to construct the graph and explore what it’d built.
Step 3: Generating Global Insights
Once the knowledge graph was constructed, I turned my attention to the real goal: surfacing insights that would be impossible to extract through traditional search or summarization alone.
The final graph contained over 123,000 nodes, 3.1 million edges, and 16,720 clusters—a dense web of ideas, arguments, and concerns. To make sense of this complexity, I used GraphRAG’s built-in query capabilities.
How it works
GraphRAG applies the Leiden clustering algorithm to group related ideas based on node connectivity and edge weights. These clusters represent emergent themes—like privacy concerns, regulatory suggestions, or ethical dilemmas—surfaced not by keyword frequency, but by conceptual proximity.
To explore the knowledge graph for response augmentation, GraphRAG uses a query-expansion approach, generally described in this blog post. Starting with a broader prompt like:
“What are the key trends across individual respondents”
GraphRAG generated five subqueries to probe the graph from different angles, captured in the Table below.
Subquery | Prompt |
---|---|
Subquery 1 | What are the recurring themes, concerns, and priorities expressed by individual respondents regarding AI, including ethical, economic, and societal impacts? |
Subquery 2 | What regulatory challenges, opportunities, and specific topics such as privacy and intellectual property were most frequently discussed by individual respondents? |
Subquery 3 | What trends can be observed in the responses based on the demographic, socioeconomic, and professional backgrounds of individual respondents? |
Subquery 4 | What actionable suggestions and recommendations were commonly proposed by individual respondents, including those related to education, workforce development, and environmental sustainability? |
Subquery 5 | What trends in tone, sentiment, and passion can be observed across individual responses to the AI Action Plan? |
Each subquery acted as a lens, identifying entry points into the graph—nodes and clusters most relevant to that line of inquiry. From there, GraphRAG traversed the graph, pulling in related nodes and their associated text chunks. This process was guided by iterative LLM-based relevance checks, ensuring that only the most contextually meaningful content was included in response generation.
What I found
The insights generated from the graph are not fixed—they evolve based on the questions you ask. GraphRAG can surface different patterns depending on the context of your inquiry, so these results are just an example from one line of questioning.
So, beginning with “What are the key trends across individual respondents?”, GraphRAG synthesized a nuanced response from over 1,200 text chunks across 800+ documents and 90 ‘idea’ clusters. Representative findings for individual respondents included:
- A strong emphasis on privacy and data protection, particularly among individual respondents with technical or legal backgrounds.
- Recurring concern about AI-driven job displacement, with varying tones depending on respondent domain.
- A surprising number of grassroots policy suggestions related to education reform, environmental sustainability, and AI transparency.
- Distinct patterns in sentiment (generally negative) and passion (high), with emotionally charged responses clustering around topics like surveillance, misinformation, and creative rights.
These examples reflect just one line of inquiry. The real strength of GraphRAG lies in its adaptability—allowing you to uncover entirely different insights depending on the questions you ask. By adjusting your query, you can explore entirely different dimensions of the dataset—whether you’re interested in regulatory gaps, stakeholder-group specific concerns, or sectoral and economic impact.
A snapshot of the graph in action
The image to the right shows a subset of the knowledge graph used to answer the “key trends” query. Each circle represents a community cluster, with sub-communities descending to the fourth level. Dots are individual ideas (nodes), and the gray lines are the relationships that connect them—used both for clustering and for traversing the graph during response generation.
This visual illustrates how GraphRAG doesn’t just retrieve relevant snippets—it maps the conceptual terrain of the dataset, allowing us to explore it with nuance and depth.
With the knowledge graph in place that is able surface these insights, the next step is making this capability available to a wider audience through tools available in the applications they use every day.
What’s Next: Creating an AI Action Plan Copilot
Across Blog Posts 1 and 2, we’ve taken 10,000+ unstructured public comments, extracted meaningful metadata from them, and created a rich knowledge graph capable of surfacing deep insights across the entire data set.
And while this is extremely interesting for me, getting lost within the data and spelunking through the graph, to be truly useful, we need to surface this capability to a broader audience.
In my final Blog Post, I’ll show how to bring these insights to life for end users—embedding them into Microsoft 365 Copilot as a custom agent that empowers policy analysts to explore, query, and act on public feedback in real time.
Key Takeaways
- Transformed structured JSON into a metadata-rich knowledge graph using GraphRAG.
- Surfaced emergent themes and sentiment patterns across 10,000+ public comments.
- Demonstrated how query expansion and graph traversal yield deeper policy insights.