MVP Mentoring Rings: Where Community Becomes a Catalyst
May 22, 2026Shaping Copilot across Word, Excel, and PowerPoint
May 22, 2026Author: Arturo Quiroga, Azure AI services Engineer – Senior Partner Solutions Architect — Microsoft
A few days ago I published From Prompt to Production: Building Azure Architecture Diagrams with AI, introducing the open-source Azure Architecture Diagram Builder. One feature got more follow-up questions than any other: the Well-Architected Framework (WAF) validation. Architects from partners and customers — many of whom already use Azure Advisor and the Well-Architected Review — wanted to know exactly what scoring algorithm we use, how it compares to Microsoft’s official tools, and whether they should be using all three.
This post is that answer. It’s a deep dive into how design-time WAF validation works, how Microsoft’s two official WAF assessment algorithms work, and where each fits in the architecture lifecycle.
TL;DR. Microsoft ships two WAF assessment vehicles — the Well-Architected Review (questionnaire, scored from human answers) and the Azure Advisor score (healthy-resources-÷-applicable-resources weighted per subcategory, with Defender Secure Score for Security and cost-weighted math for Cost). Both require either a human filling in a form or live Azure telemetry. Our app runs at design time on a diagram, before anything is deployed, using a hybrid pipeline: a deterministic rule pre-scan followed by an LLM refinement pass. Same five WAF pillars, different lifecycle stage. Complementary, not competitive.
Why design-time validation matters
Every cost overrun, reliability gap, and security incident I’ve ever debugged was cheaper to fix on a whiteboard than in production. Yet most WAF tooling assumes the architecture already exists — either because there are deployed resources to scan (Advisor) or because someone has built enough of it to answer 60 specific questions about it (WAR).
That leaves a gap. Between “rough sketch” and “deployed resource group” there is no algorithmic WAF feedback loop. That’s the gap the Diagram Builder fills.
Microsoft’s two official WAF assessment algorithms
Before describing our approach, it’s worth being precise about what Microsoft already ships, because the term “WAF assessment algorithm” can mean either of two very different things.
1. Azure Well-Architected Review (WAR) — questionnaire-based
The Well-Architected Review is a free self-assessment hosted on Microsoft Learn.
| Aspect | Detail |
|---|---|
| Input | Human answers to ~60 questions mapped to the WAF pillar checklists |
| Workload variants | Core WAR, plus AI/ML, IoT, SAP on Azure, Azure Stack Hub, SaaS, Mission Critical |
| Scoring | Derived from the answers — each “no” or unanswered question subtracts from the pillar score |
| Output | Per-pillar maturity score + prioritized recommendations + optional Advisor integration |
| Improvement tracking | “Milestones” (point-in-time snapshots) |
| When to use | Periodic deep reviews; greenfield design baselining; brownfield audits |
WAR is human-driven. The algorithm is essentially “how many of the recommended practices have you confirmed you do?” — which is exactly the right algorithm when the assessor is the workload team itself.
2. Azure Advisor Score — telemetry-based
The Advisor score is the closest thing Microsoft ships to a real, deterministic WAF algorithm. It runs continuously over your deployed Azure resources.
The math:
Pillar-specific overrides:
- Security uses Microsoft Defender for Cloud’s Secure Score model.
- Cost weights by retail $ cost of healthy resources, plus age-of-recommendation weighting; postponed/dismissed items are removed from the denominator.
- Reliability / Performance / Operational Excellence use the healthy-resources ratio above.
Key terms:
- Healthy resource — a deployed resource with no open Advisor recommendation against it for that pillar.
- Total applicable — resources Advisor was able to evaluate (excludes dismissed/snoozed).
Advisor is the right tool once you’re in production. It cannot help you before deployment, because there is nothing to count as “healthy” or “applicable.”
The missing stage: design time
Here’s the lifecycle, with each tool’s domain shaded:
Design / Diagram — Diagram Builder validation runs here.- Operate / Observe — Azure Advisor runs here continuously.
- Periodic Review — WAR runs here, typically quarterly or at major milestones.
These three stages are sequential and complementary. Our app does not replace Advisor or WAR — it adds a feedback loop earlier in the lifecycle, where corrections are cheapest.
How design-time validation works in the Azure Architecture Diagram Builder
The validator is a two-phase hybrid pipeline: deterministic local rules first, then LLM refinement. The full source lives in three files:
src/services/architectureValidator.ts— orchestrator and promptsrc/services/wafPatternDetector.ts— topology + service rule enginesrc/data/wafRules.ts— the rule knowledge base
Phase 1 — Deterministic rule pre-scan (~1 ms, no LLM)
When you click Validate Architecture, the validator runs a fully client-side rule engine against the diagram’s services, connections, and groups. There are two kinds of rules:
Architecture-pattern rules
These fire when a topology anti-pattern is detected:
| Pattern | Detection trigger |
|---|---|
single-region |
No global LB (Traffic Manager / Front Door) with ≥3 services |
single-database |
Exactly one database service, no replication signal |
no-cache |
Compute + database present, no Redis/CDN |
no-monitoring |
No Azure Monitor / App Insights / Log Analytics |
no-identity |
No Microsoft Entra ID |
no-waf |
Public web tier without WAF / Front Door / App Gateway |
direct-db-access |
An edge from a frontend service directly into a database |
no-key-vault |
4+ services and no Key Vault |
no-backup |
Database present, no Azure Backup / Recovery Services |
no-api-gateway |
2+ compute services and no APIM / App Gateway / Front Door |
Service-specific rules
Every service in the in the generated Azure Architecture diagram is matched against SERVICE_SPECIFIC_RULES by normalized type — App Service, Functions, AKS, Cosmos DB, SQL Database, Storage, Key Vault, and 22 more.
The knowledge base at a glance
| Metric | Count |
|---|---|
| Total rules | 73 |
| Architecture-pattern rules | 10 |
| Service-specific rules | 63 |
| Distinct Azure services covered | 29 |
| Rules tagged Reliability | 18 |
| Rules tagged Security | 34 |
| Rules tagged Cost Optimization | 5 |
| Rules tagged Operational Excellence | 7 |
| Rules tagged Performance Efficiency | 9 |
The preliminary score
Each finding has a severity, and severity drives a fixed point deduction from a starting score of 100:
| Severity | Deduction |
|---|---|
| critical | −12 |
| high | −7 |
| medium | −3 |
| low | −1 |
Result is floored at 10 (so even a deliberately bad architecture scores at least 10) and ceilinged at 95 (no findings ≠ perfect — there’s always something the model might still catch). This is the deterministic baseline before the LLM ever sees the architecture, and it’s what makes the pipeline reproducible.
Phase 2 — LLM contextual refinement
The pre-scan output, the topology, and the optional natural-language description are folded into a focused prompt sent to one of seven Azure OpenAI models (GPT-5.1 through 5.4, GPT-5.x Codex variants, DeepSeek V3.2 Speciale, Grok 4.1 Fast). The system prompt gives the model explicit scoring guardrails:
- Score based on what IS present, not what COULD be added.
- A well-connected architecture with appropriate services should score 60–80.
- Score below 50 only for critical gaps (no auth, no monitoring, single points of failure).
- Findings are improvement suggestions, not reasons to penalize the score severely.
The model returns strict JSON:
{
“overallScore”: 0-100,
“summary”: “2–3 sentence assessment”,
“pillars”: [
{
“pillar”: “Reliability | Security | Cost Optimization | Operational Excellence | Performance Efficiency”,
“score”: 0-100,
“findings”: [
{
“severity”: “critical | high | medium | low”,
“category”: “…”,
“issue”: “…”,
“recommendation”: “…”,
“resources”: [“service-name-1”, “service-name-2”],
“source”: “rule-based | ai-analysis”
}
]
}
],
“quickWins”: [ /* same shape as findings */ ]
}Two things to call out:
- Every finding is tagged
rule-basedorai-analysis. That tag is the credibility lever. You can always see what the deterministic engine produced versus what the model contributed on top. If you don’t trust the AI layer, you can ignore it entirely — the rule layer still stands. - The LLM is given pattern hints, not the entire rule catalog. The prompt stays small and focused, which is roughly 3–5× faster and cheaper than asking the LLM to do everything from scratch.
What the user sees
On every run the modal reports:
- Overall WAF score (0–100)
- Per-pillar score × 5 (0–100 each)
- Severity breakdown — counts of critical / high / medium / low across all findings
- Quick wins — high-impact, low-effort items the model surfaces separately
- Hybrid metadata — local findings count, patterns detected, KB rules used, preliminary score, local elapsed ms
- AI metrics — model used, reasoning effort, prompt/completion/total tokens, elapsed time
- App Insights telemetry — an
Architecture_Validatedevent with model, overall score, finding count, elapsed time
Worked example
Take this prompt, which I’ve used in demos with partners:
“A multi-region web application: Azure Front Door in front of two App Service instances in West US 2 and East US 2, both reading from an Azure SQL Database with geo-replication, with Application Insights for telemetry. No Entra ID, no Key Vault.”
After generation, Validate Architecture runs:
Phase 1 — pre-scan (deterministic), ~1 ms
- Patterns detected:
no-identity,no-key-vault - Findings produced: 8 (1 critical, 1 high, 3 medium, 3 low)
- Preliminary score: 100 − 12 − 7 − (3×3) − (1×3) = 69
Phase 2 — LLM refinement, ~6–9 s depending on model
The model accepts the two pattern hints, validates them in context, and adds three more findings of its own:
| Finding | Source | Pillar | Severity |
|---|---|---|---|
| No Microsoft Entra ID for authentication | rule-based | Security | critical |
| No Key Vault for secret management | rule-based | Security | high |
| App Service slots not used for safe deploys | ai-analysis | Operational Excellence | medium |
| SQL DB geo-replication present but RTO/RPO not documented | ai-analysis | Reliability | medium |
| No CDN for static assets behind Front Door | ai-analysis | Performance Efficiency | low |
Final scores returned by the model:
| Pillar | Score |
|---|---|
| Reliability | 78 |
| Security | 52 |
| Cost Optimization | 80 |
| Operational Excellence | 70 |
| Performance Efficiency | 75 |
| Overall | 71 |
The Security score is the lowest because two of the highest-severity findings landed there — exactly what a human reviewer would flag first.
Multi-model comparison
Because the deterministic floor is identical across runs, the Validation Comparison view becomes a fair shootout of what each LLM adds on top of the same baseline. The same diagram is scored by all seven models, and the UI surfaces:
- Overall score per model
- Per-pillar score per model
- Severity-count deltas
- Number of
ai-analysisfindings each model contributed - Quick wins each model identified
This is genuinely useful for two reasons. First, it shows that LLM scores vary — typically by ±5–10 points on the same architecture — which is exactly why we publish the rule-based vs ai-analysis tag. Second, it lets architects pick the model whose review style matches their own.
How we align with Microsoft’s algorithms
| Alignment point | What it means |
|---|---|
| Same five pillars | Identical names and scope to the official WAF |
| Same source material | Rules derived from WAF docs and Azure Architecture Center service guides |
| Severity-graded findings | Map conceptually to Advisor’s high/medium/low impact recommendations |
| Per-pillar + overall scoring | Mirrors WAR/Advisor output shape, so the results feel familiar |
Where we deliberately differ — and why
| Concern | Microsoft | Diagram Builder | Why we differ |
|---|---|---|---|
| Needs deployed resources | Advisor: yes | No — works on a diagram | We’re a design-time tool; the architecture doesn’t exist yet |
| Needs human Q&A | WAR: yes | No — derived from the diagram | One-click validation inside the design flow |
| Healthy/Applicable ratio | Advisor: yes | No | No resource-health signal exists pre-deployment |
| Subcategory fixed weights | Advisor: yes | No explicit weights | Severity is the de-facto weight (12/7/3/1) |
| Defender Secure Score for Security | Advisor: yes | No | Defender requires deployed resources |
| Cost-weighted scoring | Advisor: yes | No (separate Cost Estimation feature) | Cost is a separate pipeline in our app |
| AI/LLM refinement | Neither | Yes | Catches context-specific issues a static catalog misses, and explains findings in natural language |
| Multi-model comparison | Neither | Yes | Lets architects see scoring variance across models |
Honest limitations
I’d rather you hear these from me than discover them in production:
- LLM scores drift. ±5–10 points across models on the same diagram is normal. Treat the score as directional, the findings as actionable. The
rule-basedtag is your anchor. - No live telemetry. We can’t know if your App Service is actually using availability zones — only that you have App Service in the diagram. Advisor will tell you the truth post-deployment.
- Generic ruleset. No specialized workload branches yet (AI/ML, IoT, SAP, SaaS). WAR has those.
- No milestone tracking. Each validation run is independent. Compare runs manually using the Validation Comparison view.
- Rule coverage is finite. 29 services and 73 rules is a strong start but not exhaustive — the LLM layer exists in part to compensate for that gap.
How to use all three together
A lifecycle that actually works:
- Design — Use the Diagram Builder to sketch the architecture and validate at design time. Iterate until the per-pillar scores look reasonable and the critical/high findings are addressed.
- Deploy — Generate Bicep from the diagram, deploy, and let Azure Advisor start scoring real resources.
- Operate — Use Azure Advisor continuously. Use Defender Secure Score for security posture.
- Periodic review — Run a Core WAR every quarter or at major milestones to capture the things only humans know (business context, tradeoffs, planned debt).
None of these three replace the others. They cover different stages of the same loop.
What’s next
A few things on the roadmap I’d love feedback on:
- Milestone tracking so design-time scores can be compared over time the way WAR milestones work.
- Workload-specific rulesets mirroring WAR’s branches — starting with AI/ML.
- Direct Advisor handoff — once a diagram is deployed, surface the corresponding Advisor recommendations in the same UI to close the loop.
Try it, fork it, tell me where it’s wrong
- Live app: https://aka.ms/diagram-builder
- Source: github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder
Useful references:
- Azure Well-Architected Framework pillars
- Azure Well-Architected Review tool
- Azure Advisor score — calculation
- Use Azure WAF assessments (Advisor)
- Complete an Azure Well-Architected Review assessment
If you’re a partner or customer architect who’s already living in Advisor and WAR, I’d genuinely value your reaction — does the design-time stage feel like a real gap to you, or are you already covering it some other way? Open an issue on the repo or reply on LinkedIn.
Posted on the Azure Architecture Blog · Comments and issues welcome on the repo.