Enhancements in OneNote for the Web
May 8, 2025Simple, Smart, and Secure: The next step in sharing files in Microsoft 365
May 8, 2025Ever tried to grade an AI’s homework in healthcare? It’s trickier than you might think.
Imagine asking an AI to summarize a patient’s condition from their clinical notes: “What’s the main health concern for this hyperlipidaemia patient?”
The reference answer might be: “Patient’s cholesterol levels are abnormally high, increasing risk for stroke and heart attack.”
Two different AI models respond:
- “The patient’s cholesterol readings are excessively high, which poses an increased risk for stroke and heart disease.”
- “The patient has been found to have high cholesterol levels, which may put them at greater risk for stroke and heart attack.”
Both seem fine, right? But what if a third AI says something completely wrong:
- “Patient’s cholesterol levels are significantly low, minimizing the risk for stroke and heart attack.”
This last response completely flips the clinical meaning—yet surprisingly, when we run standard text comparison metrics like F1, BLEU, ROUGE, and METEOR, this wrong answer scores higher than the clinically accurate ones!
When Traditional Metrics Fail Us
Here’s the problem in black and white (or rather red and green, just to illustrate). Look at these scores comparing our three responses to the reference answer:
Response |
F1 |
BLEU |
ROUGE |
METEOR |
Response 1 (correct) |
0.75 |
0.349 |
0.533 |
0.760 |
Response 2 (correct) |
0.80 |
0.286 |
0.400 |
0.659 |
Response 3 (WRONG) |
0.833 |
0.605 |
0.741 |
0.788 |
The clinically opposite statement received the highest similarity scores! This isn’t just an academic concern—in healthcare, this kind of evaluation failure could lead to dangerous decisions.
Trying More Advanced Methods
We next tried some fancier approaches that claim to capture deeper meaning, like BERTScore, ClinicalBERT, and MoverScore:
Response |
BERTScore |
ClinicalBERT |
MoverScore |
Response 1 (correct) |
0.676 |
0.911 |
0.982 |
Response 2 (correct) |
0.594 |
0.926 |
0.963 |
Response 3 (WRONG) |
0.768 |
0.965 |
0.991 |
Same problem! The clinically opposite statement still wins.
Finally Getting Closer
Only when we used a larger, fine-tuned clinical embedding model did we start seeing better results:
Response |
Cosine Similarity |
Normalized Euclidean Similarity |
Response 1 (correct) |
0.934 |
0.845 |
Response 2 (correct) |
0.935 |
0.846 |
Response 3 (WRONG) |
0.892 |
0.802 |
Finally, the wrong answer scores lower—but the difference is still only about 5%. That’s cutting it way too close for healthcare decisions!
Why This Matters
Language is incredibly flexible—it’s what makes AI responses so powerful, but also what makes them hard to evaluate:
- Words mean different things in different contexts
- The same idea can be expressed in countless ways
- We can emphasize or de-emphasize details while keeping the core message
- Medical terminology varies by specialty, institution, and audience
For example, “high cholesterol levels” means something specific to a clinician (a numerical range) but might just register as “unhealthy” to a patient.
Here are a few examples:
- Hyperlipidaemia and elevated cholesterol levels (from our example above)
- Atrial fibrillation and irregular heart rhythm
- Gastroesophageal reflux disease (GERD) and acid reflux
- Osteoarthritis and degenerative joint disease
- Pneumonia and lower respiratory tract infection
- Myocardial infarction and acute coronary syndrome
- Hypertension and elevated blood pressure
- Type 2 diabetes and insulin resistance
- Chronic obstructive pulmonary disease (COPD) and emphysema
- Renal insufficiency and decreased kidney function
Each pair represents related or overlapping clinical concepts that might be expressed differently in medical documentation but refer to similar underlying conditions or states. These examples demonstrate how clinical information can be phrased in various ways while maintaining the same core medical meaning, which is exactly the challenge that evaluation metrics need to handle when assessing AI outputs in healthcare.
Why Not Just Use Multiple Choice?
Many benchmarks take the easy way out by using multiple-choice questions instead of evaluating open text. But real clinical scenarios rarely present as multiple choice! Clinicians need to interpret complex, nuanced information and generate appropriate responses—exactly what we’re asking AI to do.
Our Experiment
To dig deeper, we conducted a short experiment with:
- 20 short, clinical paragraphs across various specialties
- 10 clinically equivalent variations of each (different phrasing, same meaning)
- 3 clinically altered versions with opposite meanings (we call these mutations)
- Analysis using multiple metrics to see which could reliably distinguish correct from incorrect
A Quick Note Before We Dive In
This blog explores the challenges of evaluating AI responses in healthcare using various metrics. While we’ve used real examples and data visualizations to illustrate our points, this isn’t a peer-reviewed scientific paper advocating for specific methodologies. Rather, we’re sharing practical insights from our exploration to highlight an important issue: traditional text evaluation metrics often fall short when clinical accuracy is at stake. Our goal is to spark thoughtful discussion about how we benchmark AI in healthcare—because when it comes to clinical information, the way we measure “right” and “wrong” truly matters.
What We Found
Our analysis revealed that:
- Surface metrics (ROUGE, F1, BLEU, etc.) performed poorly, often unable to distinguish between clinically accurate and inaccurate responses.
- Semantic metrics (BERTScore, ClinicalBERT) performed better but still showed concerning overlap between correct and incorrect responses.
- Vector-based metrics using domain-specific embeddings showed the most promise for practical use.
- The model-as-a-judge approach (using an AI to evaluate other AIs) aligned surprisingly well with human judgment when properly designed.
Quick Takeaway (if you don’t have time to read further)
Benchmarking AI responses in healthcare isn’t just an academic exercise—it’s a patient safety issue. Before adopting any evaluation framework, understand what it’s actually measuring. The most accessible and most widely used metrics (like BLEU or ROUGE) are often the least reliable for clinical content, while the most effective approaches may require more resources but deliver crucial accuracy.
As we continue integrating AI into healthcare, getting evaluation right isn’t optional—it’s essential. The metrics we choose will determine whether we can trust AI systems to deliver accurate information when it matters most.
If you are still curious, please read on for details…
Findings: Breaking Down Our Metrics Comparison
Figure 1: box plot comparing various similarity metrics between “Alternatives” and “Mutations.” The x-axis represents different metrics, while the y-axis shows the score. Blue boxes represent “Alternative” and red boxes represent “Mutation.” Each box plot displays the distribution of scores for each sentence and metric with median lines and outliers.
Let’s take a closer look at the data behind our findings. This chart shows how different metrics performed when comparing clinically similar alternatives (blue) versus clinically opposite mutations (red).
What This Chart Tells Us
The graphic powerfully illustrates our main point: not all metrics are created equal when evaluating clinical text.
Surface Metrics (Left Side)
Looking at the first five metrics (ROUGE, F1, BLEU, METEOR, Jaccard Index), you can see significant overlap between the blue and red boxes. This visual confirms what we found in our example—traditional metrics often can’t reliably distinguish between clinically accurate and inaccurate content:
ROUGE and F1: Notice how the red boxes (mutations) frequently overlap with or even rise above the blue boxes (alternatives).
BLEU: Performs particularly poorly with wide distribution and substantial overlap.
METEOR: Slightly better separation but still too much overlap to be reliable.
Jaccard Index: Shows similar problems with inconsistent differentiation.
This overlap is precisely why surface metrics often fail us in healthcare contexts—they’re simply not sensitive to the crucial clinical distinctions.
Semantic Metrics (Middle)
Moving right to BERTScore, ClinicalBERT, and MoverScore, we see improvement:
The blue boxes generally sit higher than the red ones. But there’s still concerning overlap between the distributions.
ClinicalBERT performs better than general BERTScore, but not by the margin we’d hope for in clinical applications.
Vector Metrics and Model-as-Judge (Right Side)
The rightmost metrics show the most promise:
Cosine Similarity: Clear separation between alternatives and mutations.
Normalized Euclidean Similarity: Even better differentiation with minimal overlap.
Normalized Manhattan Similarity: Consistent distinction between clinically valid and invalid content.
Judge Score: The most dramatic separation, with no overlap between the distributions.
The Big Picture
This visualization brings home our key message: the metrics you choose matter enormously. The rightmost approaches (vector-based metrics and model-as-a-judge) show the clearest differentiation between clinically similar and opposite statements—precisely what’s needed for reliable healthcare AI evaluation.
Note how the Judge Score (far right) shows the most dramatic separation, with the blue box hovering near perfect scores while mutations cluster distinctly lower without any overlap. This supports our hypothesis that properly designed “model-as-a-judge” approaches align best with human clinical judgment.
The progression from left to right visually represents the evolution of evaluation methods—from simple word-matching techniques that frequently fail to sophisticated approaches that can detect subtle but critical clinical distinctions.
For healthcare applications, this data makes a compelling case for investing in more advanced evaluation methods, even when they require additional effort and computational resources. When patient safety is at stake, the ability to reliably distinguish between clinically accurate and inaccurate content isn’t just nice to have—it’s essential.
Other Findings and Considerations
Beyond Our Study: The Evolving Landscape of AI Evaluation
While our findings clearly favor more sophisticated metrics, the field of AI evaluation continues to evolve rapidly. Let’s consider both practical trade-offs and emerging approaches:
Resource Considerations
Each evaluation approach represents different trade-offs:
Surface metrics offer computational efficiency and reproducible results but sacrifice clinical accuracy—a dangerous compromise in healthcare.
Semantic and vector-based metrics strike a better balance between resource demands and performance while still being easily reproducible, making them practical for many healthcare applications.
Model-as-a-judge approaches provide greater accuracy but require more computational resources. This method is more scalable than human evaluation but is influenced by the prompting, biases of the model being used, and requires clearly defined evaluation criteria.
Methods We Didn’t Cover
Our study represents just a slice of available evaluation techniques:
- Human evaluation panels: While expensive and time-consuming, expert clinician review remains the gold standard for critical applications.
- Multi-dimensional evaluation frameworks: Systems that combine multiple metrics to create composite scores addressing different aspects of response quality.
- Task-specific performance metrics: Evaluations based on downstream clinical decision quality rather than text similarity alone.
- Reference-free evaluation: Approaches that assess clinical text quality without requiring comparison to a specific reference answer.
- Evaluating longer texts: Real-world scenarios often involve longer texts with multiple details, increasing evaluation complexity but essential for meaningful results.
The Road Ahead
AI evaluation is advancing quickly, with several promising developments on the horizon:
- Counterfactual testing: Systematically testing models with subtle variations to identify clinical reasoning failures.
- Self-consistency evaluation: Checking whether AI responses remain clinically accurate across different phrasings of the same question.
- Alignment verification techniques: Methods to ensure models consistently prioritize clinical accuracy over linguistic fluency.
- Hybrid human-AI evaluation pipelines: Systems that use AI for initial screening but escalate edge cases to human experts.
Conclusion: Choose Your Metrics Wisely
As AI continues its march into healthcare, the methods we use to evaluate these systems matter profoundly. Even our small experiment demonstrates that traditional metrics often fail to capture critical clinical distinctions—potentially allowing dangerous errors to slip through evaluation frameworks that look robust on paper.
Figure 2: Different evaluation options offered in Azure AI-Foundry, including Similarity Evaluator (model based)
The most reliable approaches, like vector-based metrics and carefully designed model-as-judge frameworks, require greater investment but deliver the accuracy needed for high-stakes healthcare applications. As these methods continue to mature, we expect evaluation systems to become both more accurate and more resource-efficient.
For teams working with healthcare AI today, the message is clear: understand what your evaluation metrics are actually measuring, choose methods appropriate to the clinical sensitivity of your application, and don’t rely solely on traditional NLP metrics that may miss critical clinical errors. Microsoft offers evaluation methods in AI Foundry and Azure ML. We will keep providing tools specific to healthcare partners & customers so you can benchmark, annotate, and evaluate any datasets, including private ones, within your own data estates and endpoints. Stay tuned for more information coming soon!
Figure 3: Azure AI Foundry also provides a simple interface for human / manual evaluation
The future of AI in healthcare depends not just on building powerful models, but on our ability to rigorously evaluate whether they’re delivering clinically sound information when it matters most. With thoughtful approaches to evaluation, we can harness AI’s potential while maintaining the high standards patients deserve.
Appendix
Summary of mutation scores compared to alternatives
Figures 4,5 and 6: Judge-Score (using PHI4 model as a judge) has 100% alignment with human evaluation, with vector metrics showing next best alignment.
Sentence Similarity Distribution Across Metrics
Figure 7 and 8: Distribution of each similar text (blue) and mutated / clinically opposite (red) for each of the 20 reference paragraphs and grouped to each metric class.
Referenced Tools and Resources
Comprehensive evaluation framework for LLM-generated content
Embeddings: kronos483/MedEmbed-large-v0.1:latest (https://huggingface.co/abhinand/MedEmbed-large-v0.1)
Similar sentences generated through GPT4o mini, with human review
Model-as-a-judge: Phi4
Prompt used:
system
####### Instructions
Your mission as a medical professional entails meticulous analysis of a clinical paragraph containing one ore more sentences verified to be accurate, and compare it to a series of alternatives that follow.
The reference paragraph is this:
{{reference}}
Please compare each of the following paragraphs to the reference and give it a grade between 0 and 10.
{{alternatives}}
{{mutations}}
Grade of 10 means the sentences are clinically of same meaning, such that a trained medical professional would use the information in the same way to treat a patient as the reference paragraph.
Grade of 0 means the sentences are clinically orthogonal, such that a trained medical professional would use the information to derive a completely different, even harmful treatment.
You should give gradual grades between 0 and 10 as you judge how close the clinical meanings are according to this evaluation criteria.
Your output should be in the Json format, with reference sentence on top, the index of each sentence in the order you see them, the sentence being graded, and the grade, like this:
{
“reference”: “this is first reference paragraph”,
“results”: [
“index”: “1”,
“sentence”, “This is the first sentence to compare”
“grade”: “8”
},
{
”index”: “2”,
”sentence”, “This is the second sentence to compare”
”grade”: “7”
}.
….
]
{
“reference”: “this is second reference paragraph”,
“results”: [
…
]
}
assistant
Your response:
Example from similar clinical sentence dataset
{
“reference”: “Patient’s cholesterol levels are abnormally high, increasing risk for stroke and heart attack”,
“alternatives”: [
“The patient’s cholesterol levels are elevated, which raises the likelihood of experiencing a stroke or heart attack.”,
“Cholesterol levels in the patient are significantly high, heightening the risk for both stroke and heart attack.”,
“The patient exhibits high cholesterol levels, which may lead to an increased risk of stroke and myocardial infarction.”,
“Abnormal cholesterol levels have been noted in the patient, contributing to a higher risk of stroke and cardiac events.”,
“The patient’s cholesterol readings are excessively high, which poses an increased risk for stroke and heart disease.”,
“Elevated cholesterol levels have been observed in the patient, which could potentially increase the chances of stroke and heart attack.”,
“The patient’s lipid profile shows high cholesterol levels, thereby increasing the risk for cerebrovascular accidents and myocardial infarctions.”,
“Cholesterol levels in the patient are concerningly high, which could elevate the risk for both stroke and heart attacks.”,
“The patient has been found to have high cholesterol levels, which may put them at greater risk for stroke and heart attack.”,
“High cholesterol levels have been detected in the patient, raising concerns about the potential for stroke and heart attack.”
],
“mutations”: [
“Patient’s cholesterol levels are significantly low, minimizing the risk for stroke and heart attack”,
“Patient’s cholesterol levels are within optimal range, thereby decreasing the likelihood of stroke and heart attack”,
“Patient’s cholesterol levels are moderately elevated, warranting routine monitoring but not significantly impacting stroke and heart attack risk”
]
}