WAR, Azure Advisor, and Us (Azure Arch Diagram Builder): Three Ways to Score an Azure Architecture

MVP Mentoring Rings: Where Community Becomes a Catalyst

May 22, 2026

Shaping Copilot across Word, Excel, and PowerPoint

May 22, 2026

Published by azurefeeds on May 22, 2026

Why design-time validation matters

Every cost overrun, reliability gap, and security incident I’ve ever debugged was cheaper to fix on a whiteboard than in production. Yet most WAF tooling assumes the architecture already exists — either because there are deployed resources to scan (Advisor) or because someone has built enough of it to answer 60 specific questions about it (WAR).

That leaves a gap. Between “rough sketch” and “deployed resource group” there is no algorithmic WAF feedback loop. That’s the gap the Diagram Builder fills.

Microsoft’s two official WAF assessment algorithms

Before describing our approach, it’s worth being precise about what Microsoft already ships, because the term “WAF assessment algorithm” can mean either of two very different things.

1. Azure Well-Architected Review (WAR) — questionnaire-based

The Well-Architected Review is a free self-assessment hosted on Microsoft Learn.

Aspect	Detail
Input	Human answers to ~60 questions mapped to the WAF pillar checklists
Workload variants	Core WAR, plus AI/ML, IoT, SAP on Azure, Azure Stack Hub, SaaS, Mission Critical
Scoring	Derived from the answers — each “no” or unanswered question subtracts from the pillar score
Output	Per-pillar maturity score + prioritized recommendations + optional Advisor integration
Improvement tracking	“Milestones” (point-in-time snapshots)
When to use	Periodic deep reviews; greenfield design baselining; brownfield audits

WAR is human-driven. The algorithm is essentially “how many of the recommended practices have you confirmed you do?” — which is exactly the right algorithm when the assessor is the workload team itself.

2. Azure Advisor Score — telemetry-based

The Advisor score is the closest thing Microsoft ships to a real, deterministic WAF algorithm. It runs continuously over your deployed Azure resources.

The math:

Pillar-specific overrides:

Security uses Microsoft Defender for Cloud’s Secure Score model.

Cost weights by retail $ cost of healthy resources, plus age-of-recommendation weighting; postponed/dismissed items are removed from the denominator.

Reliability / Performance / Operational Excellence use the healthy-resources ratio above.

Key terms:

Healthy resource — a deployed resource with no open Advisor recommendation against it for that pillar.

Total applicable — resources Advisor was able to evaluate (excludes dismissed/snoozed).

Advisor is the right tool once you’re in production. It cannot help you before deployment, because there is nothing to count as “healthy” or “applicable.”

The missing stage: design time

Here’s the lifecycle, with each tool’s domain shaded:



Design / Diagram — Diagram Builder validation runs here.

Operate / Observe — Azure Advisor runs here continuously.

Periodic Review — WAR runs here, typically quarterly or at major milestones.

These three stages are sequential and complementary. Our app does not replace Advisor or WAR — it adds a feedback loop earlier in the lifecycle, where corrections are cheapest.

How design-time validation works in the Azure Architecture Diagram Builder

The validator is a two-phase hybrid pipeline: deterministic local rules first, then LLM refinement. The full source lives in three files:

src/services/architectureValidator.ts — orchestrator and prompt

src/services/wafPatternDetector.ts — topology + service rule engine

src/data/wafRules.ts — the rule knowledge base

Phase 1 — Deterministic rule pre-scan (~1 ms, no LLM)

When you click Validate Architecture, the validator runs a fully client-side rule engine against the diagram’s services, connections, and groups. There are two kinds of rules:

Architecture-pattern rules

These fire when a topology anti-pattern is detected:

Pattern	Detection trigger
`single-region`	No global LB (Traffic Manager / Front Door) with ≥3 services
`single-database`	Exactly one database service, no replication signal
`no-cache`	Compute + database present, no Redis/CDN
`no-monitoring`	No Azure Monitor / App Insights / Log Analytics
`no-identity`	No Microsoft Entra ID
`no-waf`	Public web tier without WAF / Front Door / App Gateway
`direct-db-access`	An edge from a frontend service directly into a database
`no-key-vault`	4+ services and no Key Vault
`no-backup`	Database present, no Azure Backup / Recovery Services
`no-api-gateway`	2+ compute services and no APIM / App Gateway / Front Door

Service-specific rules

Every service in the in the generated Azure Architecture diagram is matched against SERVICE_SPECIFIC_RULES by normalized type — App Service, Functions, AKS, Cosmos DB, SQL Database, Storage, Key Vault, and 22 more.

The knowledge base at a glance

Metric	Count
Total rules	73
Architecture-pattern rules	10
Service-specific rules	63
Distinct Azure services covered	29
Rules tagged Reliability	18
Rules tagged Security	34
Rules tagged Cost Optimization	5
Rules tagged Operational Excellence	7
Rules tagged Performance Efficiency	9

The preliminary score

Each finding has a severity, and severity drives a fixed point deduction from a starting score of 100:

Severity	Deduction
critical	−12
high	−7
medium	−3
low	−1

Result is floored at 10 (so even a deliberately bad architecture scores at least 10) and ceilinged at 95 (no findings ≠ perfect — there’s always something the model might still catch). This is the deterministic baseline before the LLM ever sees the architecture, and it’s what makes the pipeline reproducible.

Phase 2 — LLM contextual refinement

The pre-scan output, the topology, and the optional natural-language description are folded into a focused prompt sent to one of seven Azure OpenAI models (GPT-5.1 through 5.4, GPT-5.x Codex variants, DeepSeek V3.2 Speciale, Grok 4.1 Fast). The system prompt gives the model explicit scoring guardrails:

Score based on what IS present, not what COULD be added.

A well-connected architecture with appropriate services should score 60–80.

Score below 50 only for critical gaps (no auth, no monitoring, single points of failure).

Findings are improvement suggestions, not reasons to penalize the score severely.

The model returns strict JSON:

{

  “overallScore”: 0-100,

  “summary”: “2–3 sentence assessment”,

  “pillars”: [

    {

      “pillar”: “Reliability | Security | Cost Optimization | Operational Excellence | Performance Efficiency”,

      “score”: 0-100,

      “findings”: [

        {

          “severity”: “critical | high | medium | low”,

          “category”: “…”,

          “issue”: “…”,

          “recommendation”: “…”,

          “resources”: [“service-name-1”, “service-name-2”],

          “source”: “rule-based | ai-analysis”

        }

      ]

    }

  ],

  “quickWins”: [ /* same shape as findings */ ]

}

Two things to call out:

Every finding is tagged rule-based or ai-analysis. That tag is the credibility lever. You can always see what the deterministic engine produced versus what the model contributed on top. If you don’t trust the AI layer, you can ignore it entirely — the rule layer still stands.

The LLM is given pattern hints, not the entire rule catalog. The prompt stays small and focused, which is roughly 3–5× faster and cheaper than asking the LLM to do everything from scratch.

What the user sees

On every run the modal reports:

Overall WAF score (0–100)

Per-pillar score × 5 (0–100 each)

Severity breakdown — counts of critical / high / medium / low across all findings

Quick wins — high-impact, low-effort items the model surfaces separately

Hybrid metadata — local findings count, patterns detected, KB rules used, preliminary score, local elapsed ms

AI metrics — model used, reasoning effort, prompt/completion/total tokens, elapsed time

App Insights telemetry — an Architecture_Validated event with model, overall score, finding count, elapsed time

Worked example

Take this prompt, which I’ve used in demos with partners:

“A multi-region web application: Azure Front Door in front of two App Service instances in West US 2 and East US 2, both reading from an Azure SQL Database with geo-replication, with Application Insights for telemetry. No Entra ID, no Key Vault.”

After generation, Validate Architecture runs:

Phase 1 — pre-scan (deterministic), ~1 ms

Patterns detected: no-identity, no-key-vault

Findings produced: 8 (1 critical, 1 high, 3 medium, 3 low)

Preliminary score: 100 − 12 − 7 − (3×3) − (1×3) = 69

Phase 2 — LLM refinement, ~6–9 s depending on model

The model accepts the two pattern hints, validates them in context, and adds three more findings of its own:

Finding	Source	Pillar	Severity
No Microsoft Entra ID for authentication	rule-based	Security	critical
No Key Vault for secret management	rule-based	Security	high
App Service slots not used for safe deploys	ai-analysis	Operational Excellence	medium
SQL DB geo-replication present but RTO/RPO not documented	ai-analysis	Reliability	medium
No CDN for static assets behind Front Door	ai-analysis	Performance Efficiency	low

Final scores returned by the model:

Pillar	Score
Reliability	78
Security	52
Cost Optimization	80
Operational Excellence	70
Performance Efficiency	75
Overall	71

The Security score is the lowest because two of the highest-severity findings landed there — exactly what a human reviewer would flag first.

Multi-model comparison

Because the deterministic floor is identical across runs, the Validation Comparison view becomes a fair shootout of what each LLM adds on top of the same baseline. The same diagram is scored by all seven models, and the UI surfaces:

Overall score per model

Per-pillar score per model

Severity-count deltas

Number of ai-analysis findings each model contributed

Quick wins each model identified

This is genuinely useful for two reasons. First, it shows that LLM scores vary — typically by ±5–10 points on the same architecture — which is exactly why we publish the rule-based vs ai-analysis tag. Second, it lets architects pick the model whose review style matches their own.

How we align with Microsoft’s algorithms

Alignment point	What it means
Same five pillars	Identical names and scope to the official WAF
Same source material	Rules derived from WAF docs and Azure Architecture Center service guides
Severity-graded findings	Map conceptually to Advisor’s high/medium/low impact recommendations
Per-pillar + overall scoring	Mirrors WAR/Advisor output shape, so the results feel familiar

Where we deliberately differ — and why

Concern	Microsoft	Diagram Builder	Why we differ
Needs deployed resources	Advisor: yes	No — works on a diagram	We’re a design-time tool; the architecture doesn’t exist yet
Needs human Q&A	WAR: yes	No — derived from the diagram	One-click validation inside the design flow
Healthy/Applicable ratio	Advisor: yes	No	No resource-health signal exists pre-deployment
Subcategory fixed weights	Advisor: yes	No explicit weights	Severity is the de-facto weight (12/7/3/1)
Defender Secure Score for Security	Advisor: yes	No	Defender requires deployed resources
Cost-weighted scoring	Advisor: yes	No (separate Cost Estimation feature)	Cost is a separate pipeline in our app
AI/LLM refinement	Neither	Yes	Catches context-specific issues a static catalog misses, and explains findings in natural language
Multi-model comparison	Neither	Yes	Lets architects see scoring variance across models

Honest limitations

I’d rather you hear these from me than discover them in production:

LLM scores drift. ±5–10 points across models on the same diagram is normal. Treat the score as directional, the findings as actionable. The rule-based tag is your anchor.

No live telemetry. We can’t know if your App Service is actually using availability zones — only that you have App Service in the diagram. Advisor will tell you the truth post-deployment.

Generic ruleset. No specialized workload branches yet (AI/ML, IoT, SAP, SaaS). WAR has those.

No milestone tracking. Each validation run is independent. Compare runs manually using the Validation Comparison view.

Rule coverage is finite. 29 services and 73 rules is a strong start but not exhaustive — the LLM layer exists in part to compensate for that gap.

How to use all three together

A lifecycle that actually works:

Design — Use the Diagram Builder to sketch the architecture and validate at design time. Iterate until the per-pillar scores look reasonable and the critical/high findings are addressed.

Deploy — Generate Bicep from the diagram, deploy, and let Azure Advisor start scoring real resources.

Operate — Use Azure Advisor continuously. Use Defender Secure Score for security posture.

Periodic review — Run a Core WAR every quarter or at major milestones to capture the things only humans know (business context, tradeoffs, planned debt).

None of these three replace the others. They cover different stages of the same loop.

What’s next

A few things on the roadmap I’d love feedback on:

Milestone tracking so design-time scores can be compared over time the way WAR milestones work.

Workload-specific rulesets mirroring WAR’s branches — starting with AI/ML.

Direct Advisor handoff — once a diagram is deployed, surface the corresponding Advisor recommendations in the same UI to close the loop.

Try it, fork it, tell me where it’s wrong

Live app: https://aka.ms/diagram-builder

Source: github.com/Arturo-Quiroga-MSFT/azure-architecture-diagram-builder

Useful references:

Azure Well-Architected Framework pillars

Azure Well-Architected Review tool

Azure Advisor score — calculation

Use Azure WAF assessments (Advisor)

Complete an Azure Well-Architected Review assessment

If you’re a partner or customer architect who’s already living in Advisor and WAR, I’d genuinely value your reaction — does the design-time stage feel like a real gap to you, or are you already covering it some other way? Open an issue on the repo or reply on LinkedIn.

Posted on the Azure Architecture Blog · Comments and issues welcome on the repo.

MVP Mentoring Rings: Where Community Becomes a Catalyst

Shaping Copilot across Word, Excel, and PowerPoint

MVP Mentoring Rings: Where Community Becomes a Catalyst

Shaping Copilot across Word, Excel, and PowerPoint

Why design-time validation matters

Microsoft’s two official WAF assessment algorithms

1. Azure Well-Architected Review (WAR) — questionnaire-based

2. Azure Advisor Score — telemetry-based

The missing stage: design time

How design-time validation works in the Azure Architecture Diagram Builder

Phase 1 — Deterministic rule pre-scan (~1 ms, no LLM)

Architecture-pattern rules

Service-specific rules

The knowledge base at a glance

The preliminary score

Phase 2 — LLM contextual refinement

What the user sees

Worked example

Multi-model comparison

How we align with Microsoft’s algorithms

Where we deliberately differ — and why

Honest limitations

How to use all three together

What’s next

Try it, fork it, tell me where it’s wrong

Related posts

Teaching AI to Remember: Exploring Memory Store in Microsoft Foundry

Shaping Copilot across Word, Excel, and PowerPoint

MVP Mentoring Rings: Where Community Becomes a Catalyst