Enhancements in OneNote for the Web

May 8, 2025

Simple, Smart, and Secure: The next step in sharing files in Microsoft 365

May 8, 2025

Published by azurefeeds on May 8, 2025

When Traditional Metrics Fail Us

Here’s the problem in black and white (or rather red and green, just to illustrate). Look at these scores comparing our three responses to the reference answer:

Response	F1	BLEU	ROUGE	METEOR
Response 1 (correct)	0.75	0.349	0.533	0.760
Response 2 (correct)	0.80	0.286	0.400	0.659
Response 3 (WRONG)	0.833	0.605	0.741	0.788

The clinically opposite statement received the highest similarity scores! This isn’t just an academic concern—in healthcare, this kind of evaluation failure could lead to dangerous decisions.

Trying More Advanced Methods

We next tried some fancier approaches that claim to capture deeper meaning, like BERTScore, ClinicalBERT, and MoverScore:

Response	BERTScore	ClinicalBERT	MoverScore
Response 1 (correct)	0.676	0.911	0.982
Response 2 (correct)	0.594	0.926	0.963
Response 3 (WRONG)	0.768	0.965	0.991

Same problem! The clinically opposite statement still wins.

Finally Getting Closer

Only when we used a larger, fine-tuned clinical embedding model did we start seeing better results:

Response	Cosine Similarity	Normalized Euclidean Similarity
Response 1 (correct)	0.934	0.845
Response 2 (correct)	0.935	0.846
Response 3 (WRONG)	0.892	0.802

Finally, the wrong answer scores lower—but the difference is still only about 5%. That’s cutting it way too close for healthcare decisions!

Why This Matters

Language is incredibly flexible—it’s what makes AI responses so powerful, but also what makes them hard to evaluate:

Words mean different things in different contexts

The same idea can be expressed in countless ways

We can emphasize or de-emphasize details while keeping the core message

Medical terminology varies by specialty, institution, and audience

For example, “high cholesterol levels” means something specific to a clinician (a numerical range) but might just register as “unhealthy” to a patient.

Here are a few examples:

Hyperlipidaemia and elevated cholesterol levels (from our example above)

Atrial fibrillation and irregular heart rhythm

Gastroesophageal reflux disease (GERD) and acid reflux

Osteoarthritis and degenerative joint disease

Pneumonia and lower respiratory tract infection

Myocardial infarction and acute coronary syndrome

Hypertension and elevated blood pressure

Type 2 diabetes and insulin resistance

Chronic obstructive pulmonary disease (COPD) and emphysema

Renal insufficiency and decreased kidney function

Each pair represents related or overlapping clinical concepts that might be expressed differently in medical documentation but refer to similar underlying conditions or states. These examples demonstrate how clinical information can be phrased in various ways while maintaining the same core medical meaning, which is exactly the challenge that evaluation metrics need to handle when assessing AI outputs in healthcare.

Why Not Just Use Multiple Choice?

Many benchmarks take the easy way out by using multiple-choice questions instead of evaluating open text. But real clinical scenarios rarely present as multiple choice! Clinicians need to interpret complex, nuanced information and generate appropriate responses—exactly what we’re asking AI to do.

Our Experiment

To dig deeper, we conducted a short experiment with:

20 short, clinical paragraphs across various specialties

10 clinically equivalent variations of each (different phrasing, same meaning)

3 clinically altered versions with opposite meanings (we call these mutations)

Analysis using multiple metrics to see which could reliably distinguish correct from incorrect

A Quick Note Before We Dive In

This blog explores the challenges of evaluating AI responses in healthcare using various metrics. While we’ve used real examples and data visualizations to illustrate our points, this isn’t a peer-reviewed scientific paper advocating for specific methodologies. Rather, we’re sharing practical insights from our exploration to highlight an important issue: traditional text evaluation metrics often fall short when clinical accuracy is at stake. Our goal is to spark thoughtful discussion about how we benchmark AI in healthcare—because when it comes to clinical information, the way we measure “right” and “wrong” truly matters.

What We Found

Our analysis revealed that:

Surface metrics (ROUGE, F1, BLEU, etc.) performed poorly, often unable to distinguish between clinically accurate and inaccurate responses.

Semantic metrics (BERTScore, ClinicalBERT) performed better but still showed concerning overlap between correct and incorrect responses.

Vector-based metrics using domain-specific embeddings showed the most promise for practical use.

The model-as-a-judge approach (using an AI to evaluate other AIs) aligned surprisingly well with human judgment when properly designed.

Quick Takeaway (if you don’t have time to read further)

Benchmarking AI responses in healthcare isn’t just an academic exercise—it’s a patient safety issue. Before adopting any evaluation framework, understand what it’s actually measuring. The most accessible and most widely used metrics (like BLEU or ROUGE) are often the least reliable for clinical content, while the most effective approaches may require more resources but deliver crucial accuracy.

As we continue integrating AI into healthcare, getting evaluation right isn’t optional—it’s essential. The metrics we choose will determine whether we can trust AI systems to deliver accurate information when it matters most.

If you are still curious, please read on for details…

Findings: Breaking Down Our Metrics Comparison

Figure 1: box plot comparing various similarity metrics between “Alternatives” and “Mutations.” The x-axis represents different metrics, while the y-axis shows the score. Blue boxes represent “Alternative” and red boxes represent “Mutation.” Each box plot displays the distribution of scores for each sentence and metric with median lines and outliers.

Let’s take a closer look at the data behind our findings. This chart shows how different metrics performed when comparing clinically similar alternatives (blue) versus clinically opposite mutations (red).

What This Chart Tells Us

The graphic powerfully illustrates our main point: not all metrics are created equal when evaluating clinical text.

Surface Metrics (Left Side)

Looking at the first five metrics (ROUGE, F1, BLEU, METEOR, Jaccard Index), you can see significant overlap between the blue and red boxes. This visual confirms what we found in our example—traditional metrics often can’t reliably distinguish between clinically accurate and inaccurate content:

ROUGE and F1: Notice how the red boxes (mutations) frequently overlap with or even rise above the blue boxes (alternatives).

BLEU: Performs particularly poorly with wide distribution and substantial overlap.

METEOR: Slightly better separation but still too much overlap to be reliable.

Jaccard Index: Shows similar problems with inconsistent differentiation.

This overlap is precisely why surface metrics often fail us in healthcare contexts—they’re simply not sensitive to the crucial clinical distinctions.

Semantic Metrics (Middle)

Moving right to BERTScore, ClinicalBERT, and MoverScore, we see improvement:

The blue boxes generally sit higher than the red ones. But there’s still concerning overlap between the distributions.

ClinicalBERT performs better than general BERTScore, but not by the margin we’d hope for in clinical applications.

Vector Metrics and Model-as-Judge (Right Side)

The rightmost metrics show the most promise:

Cosine Similarity: Clear separation between alternatives and mutations.

Normalized Euclidean Similarity: Even better differentiation with minimal overlap.

Normalized Manhattan Similarity: Consistent distinction between clinically valid and invalid content.

Judge Score: The most dramatic separation, with no overlap between the distributions.

The Big Picture

This visualization brings home our key message: the metrics you choose matter enormously. The rightmost approaches (vector-based metrics and model-as-a-judge) show the clearest differentiation between clinically similar and opposite statements—precisely what’s needed for reliable healthcare AI evaluation.

Note how the Judge Score (far right) shows the most dramatic separation, with the blue box hovering near perfect scores while mutations cluster distinctly lower without any overlap. This supports our hypothesis that properly designed “model-as-a-judge” approaches align best with human clinical judgment.

The progression from left to right visually represents the evolution of evaluation methods—from simple word-matching techniques that frequently fail to sophisticated approaches that can detect subtle but critical clinical distinctions.

For healthcare applications, this data makes a compelling case for investing in more advanced evaluation methods, even when they require additional effort and computational resources. When patient safety is at stake, the ability to reliably distinguish between clinically accurate and inaccurate content isn’t just nice to have—it’s essential.

Other Findings and Considerations

Beyond Our Study: The Evolving Landscape of AI Evaluation

While our findings clearly favor more sophisticated metrics, the field of AI evaluation continues to evolve rapidly. Let’s consider both practical trade-offs and emerging approaches:

Resource Considerations

Each evaluation approach represents different trade-offs:

Surface metrics offer computational efficiency and reproducible results but sacrifice clinical accuracy—a dangerous compromise in healthcare.

Semantic and vector-based metrics strike a better balance between resource demands and performance while still being easily reproducible, making them practical for many healthcare applications.

Model-as-a-judge approaches provide greater accuracy but require more computational resources. This method is more scalable than human evaluation but is influenced by the prompting, biases of the model being used, and requires clearly defined evaluation criteria.

Methods We Didn’t Cover

Our study represents just a slice of available evaluation techniques:

Human evaluation panels: While expensive and time-consuming, expert clinician review remains the gold standard for critical applications.

Multi-dimensional evaluation frameworks: Systems that combine multiple metrics to create composite scores addressing different aspects of response quality.

Task-specific performance metrics: Evaluations based on downstream clinical decision quality rather than text similarity alone.

Reference-free evaluation: Approaches that assess clinical text quality without requiring comparison to a specific reference answer.

Evaluating longer texts: Real-world scenarios often involve longer texts with multiple details, increasing evaluation complexity but essential for meaningful results.

The Road Ahead

AI evaluation is advancing quickly, with several promising developments on the horizon:

Counterfactual testing: Systematically testing models with subtle variations to identify clinical reasoning failures.

Self-consistency evaluation: Checking whether AI responses remain clinically accurate across different phrasings of the same question.

Alignment verification techniques: Methods to ensure models consistently prioritize clinical accuracy over linguistic fluency.

Hybrid human-AI evaluation pipelines: Systems that use AI for initial screening but escalate edge cases to human experts.

Conclusion: Choose Your Metrics Wisely

As AI continues its march into healthcare, the methods we use to evaluate these systems matter profoundly. Even our small experiment demonstrates that traditional metrics often fail to capture critical clinical distinctions—potentially allowing dangerous errors to slip through evaluation frameworks that look robust on paper.

Figure 2: Different evaluation options offered in Azure AI-Foundry, including Similarity Evaluator (model based)

The most reliable approaches, like vector-based metrics and carefully designed model-as-judge frameworks, require greater investment but deliver the accuracy needed for high-stakes healthcare applications. As these methods continue to mature, we expect evaluation systems to become both more accurate and more resource-efficient.

For teams working with healthcare AI today, the message is clear: understand what your evaluation metrics are actually measuring, choose methods appropriate to the clinical sensitivity of your application, and don’t rely solely on traditional NLP metrics that may miss critical clinical errors. Microsoft offers evaluation methods in AI Foundry and Azure ML. We will keep providing tools specific to healthcare partners & customers so you can benchmark, annotate, and evaluate any datasets, including private ones, within your own data estates and endpoints. Stay tuned for more information coming soon!

Figure 3: Azure AI Foundry also provides a simple interface for human / manual evaluation

The future of AI in healthcare depends not just on building powerful models, but on our ability to rigorously evaluate whether they’re delivering clinically sound information when it matters most. With thoughtful approaches to evaluation, we can harness AI’s potential while maintaining the high standards patients deserve.

Appendix

Summary of mutation scores compared to alternatives

Figures 4,5 and 6: Judge-Score (using PHI4 model as a judge) has 100% alignment with human evaluation, with vector metrics showing next best alignment.

Sentence Similarity Distribution Across Metrics

Figure 7 and 8: Distribution of each similar text (blue) and mutated / clinically opposite (red) for each of the 20 reference paragraphs and grouped to each metric class.

Referenced Tools and Resources

Comprehensive evaluation framework for LLM-generated content

Evaluating NLP Models: A Comprehensive Guide to ROUGE, BLEU, METEOR, and BERTScore Metrics | In Plain English

Embeddings: kronos483/MedEmbed-large-v0.1:latest (https://huggingface.co/abhinand/MedEmbed-large-v0.1)

Similar sentences generated through GPT4o mini, with human review

Model-as-a-judge: Phi4

Prompt used:

system
####### Instructions
Your mission as a medical professional entails meticulous analysis of a clinical paragraph containing one ore more sentences verified to be accurate, and compare it to a series of alternatives that follow.
The reference paragraph is this:
{{reference}}
Please compare each of the following paragraphs to the reference and give it a grade between 0 and 10.
{{alternatives}}
{{mutations}}
Grade of 10 means the sentences are clinically of same meaning, such that a trained medical professional would use the information in the same way to treat a patient as the reference paragraph.
Grade of 0 means the sentences are clinically orthogonal, such that a trained medical professional would use the information to derive a completely different, even harmful treatment.
You should give gradual grades between 0 and 10 as you judge how close the clinical meanings are according to this evaluation criteria.
Your output should be in the Json format, with reference sentence on top, the index of each sentence in the order you see them, the sentence being graded, and the grade, like this:
{
  “reference”: “this is first reference paragraph”,
  “results”: [
    “index”: “1”,
    “sentence”, “This is the first sentence to compare”
    “grade”: “8”
   },
{
   ”index”: “2”,
   ”sentence”, “This is the second sentence to compare”
   ”grade”: “7”
}.
….
]
{
  “reference”: “this is second reference paragraph”,
  “results”: [
   …
  ]
}

assistant
Your response:

Example from similar clinical sentence dataset

{
  “reference”: “Patient’s cholesterol levels are abnormally high, increasing risk for stroke and heart attack”,
  “alternatives”: [
    “The patient’s cholesterol levels are elevated, which raises the likelihood of experiencing a stroke or heart attack.”,
    “Cholesterol levels in the patient are significantly high, heightening the risk for both stroke and heart attack.”,
    “The patient exhibits high cholesterol levels, which may lead to an increased risk of stroke and myocardial infarction.”,
    “Abnormal cholesterol levels have been noted in the patient, contributing to a higher risk of stroke and cardiac events.”,
    “The patient’s cholesterol readings are excessively high, which poses an increased risk for stroke and heart disease.”,
    “Elevated cholesterol levels have been observed in the patient, which could potentially increase the chances of stroke and heart attack.”,
    “The patient’s lipid profile shows high cholesterol levels, thereby increasing the risk for cerebrovascular accidents and myocardial infarctions.”,
    “Cholesterol levels in the patient are concerningly high, which could elevate the risk for both stroke and heart attacks.”,
    “The patient has been found to have high cholesterol levels, which may put them at greater risk for stroke and heart attack.”,
    “High cholesterol levels have been detected in the patient, raising concerns about the potential for stroke and heart attack.”
  ],
  “mutations”: [
    “Patient’s cholesterol levels are significantly low, minimizing the risk for stroke and heart attack”,
    “Patient’s cholesterol levels are within optimal range, thereby decreasing the likelihood of stroke and heart attack”,
    “Patient’s cholesterol levels are moderately elevated, warranting routine monitoring but not significantly impacting stroke and heart attack risk”
  ]
}

Enhancements in OneNote for the Web

Simple, Smart, and Secure: The next step in sharing files in Microsoft 365

Enhancements in OneNote for the Web

Simple, Smart, and Secure: The next step in sharing files in Microsoft 365

When Traditional Metrics Fail Us

Trying More Advanced Methods

Finally Getting Closer

Why This Matters

Why Not Just Use Multiple Choice?

Our Experiment

A Quick Note Before We Dive In

What We Found

Quick Takeaway (if you don’t have time to read further)

Findings: Breaking Down Our Metrics Comparison

What This Chart Tells Us

Surface Metrics (Left Side)

Semantic Metrics (Middle)

Vector Metrics and Model-as-Judge (Right Side)

The Big Picture

Other Findings and Considerations

Beyond Our Study: The Evolving Landscape of AI Evaluation

Resource Considerations

Methods We Didn’t Cover

The Road Ahead

Conclusion: Choose Your Metrics Wisely

Figure 2: Different evaluation options offered in Azure AI-Foundry, including Similarity Evaluator (model based)

Figure 3: Azure AI Foundry also provides a simple interface for human / manual evaluation

Appendix

Summary of mutation scores compared to alternatives

Sentence Similarity Distribution Across Metrics

Referenced Tools and Resources

Related posts

Understanding the ICL impact

Understanding the ICL impact [Under the draft, reviewing by the team]

Announcing General Availability of Microsoft Purview SDK and APIs