Cost-Optimised Add-on Scaling in AKS: Right-Size Your System Add-ons (Preview)

June 21, 2025

General Availability: Key Attestation for Azure Managed HSM

June 21, 2025

Published by azurefeeds on June 21, 2025

Prioritize safety with the new leaderboard

The safety leaderboard ranks the top models based on their robustness against generating harmful content. This is especially valuable in regulated or high-risk domains—such as healthcare, education, or financial services—where model outputs must meet high safety standards.

To ensure benchmark rigor and relevance, we apply a structured filtering and validation process to select benchmarks. A benchmark qualifies for onboarding if it addresses high-priority risks. For safety and responsible AI leaderboards, we look at different benchmarks that can be considered reliable enough to provide some signals on the targeted areas of interest as they relate to safety. Our current safety leaderboard uses the HarmBench benchmark which includes prompts to illicit harmful behaviors from models. The benchmark covers 7 semantic categories of behaviors:

Cybercrime & Unauthorized Intrusion

Chemical & Biological Weapons/Drugs

Misinformation & Disinformation

Harassment & Bullying

Illegal Activities

General Harm

These 7 categories are organized into three broader functional groupings:

Standard Harmful Behaviors

Contextual Harmful Behaviors

Copyright Violations

Each grouping is featured in a separate responsible AI scenario leaderboard. We use the prompts evaluators from HarmBench to calculate Attack Success Rate (ASR) and aggregate them across the functional groupings to proxy model safety. Lower ASR values means that a model is more robust against attacks to illicit harmful content.

We understand and acknowledge that model safety is a complex topic and has several dimensions. No single current open-source benchmark can test or represent the full spectrum of model safety in different scenarios. Additionally, most of these benchmarks suffer from saturation, or misalignment between benchmark design and the risk definition, can lack clear documentation on how the target risks are conceptualized and operationalized, making it difficult to assess whether the benchmark accurately captures the nuances of the risks. This can lead to either overestimating or underestimating model performance in real-world safety scenarios. While HarmBench dataset covers a limited set of harmful topics, it can still provide a high-level understanding of safety trends.

Navigate trade-offs with the quality-safety chart

Model selection often involves compromise across multiple criteria. Our new quality–safety trade-off chart helps you make informed decisions by comparing models based on their performance in safety and quality. You can:

Identify the safest model measured by Attack Success Rate (lower is better) at a given level of quality performance;

Or choose the highest-performing model in quality (higher is better) that still meets a defined safety threshold.

Together with the quality-cost trade-off chart, you would be able to find the best trade-off between quality, safety, and cost in selecting a model:

Scenario-based responsible AI leaderboards

To support customers’ diverse responsible AI scenarios, we have added 5 new leaderboards to rank the top models in safety and broader responsibility AI scenarios. Each leaderboard is powered by industry-standard public benchmarks covering:

Model robustness against harmful behaviors using HarmBench in 3 scenarios, targeting standard harmful behaviors, contextually harmful behaviors, and copyright violations:

Consistent with the safety leaderboard, lower ASR scores for a model mean better robustness against generating harmful content.

Model ability to detect toxic content using the Toxigen benchmark:

This benchmark targets adversarial and implicit hate speech detection. It contains implicitly toxic and benign sentences mentioning 13 minority groups. Higher accuracy based on F1-score for a model means its better ability to detect toxic content.

Model knowledge of sensitive domains including cybersecurity, biosecurity, and chemical security, using the Weapons of Mass Destruction Proxy benchmark (WMDP):

A higher accuracy score for a model denotes more knowledge of dangerous capabilities.

These scenario leaderboards allow developers, compliance teams, and AI governance stakeholders to align model selection with organizational risk tolerance and regulatory expectations.

Building Trustworthy AI Starts with the Right Tools

With safety leaderboards now available in public preview, Foundry model leaderboards offer a unified, transparent, and data-driven foundation for selecting models that align with your safety requirements. This addition empowers teams to move from ad hoc evaluation to principled model selection—anchored in industry-standard benchmarks and responsible AI practices.

To learn more, explore the methodology documentation and start building AI solutions you—and your stakeholders—can trust.

Cost-Optimised Add-on Scaling in AKS: Right-Size Your System Add-ons (Preview)

General Availability: Key Attestation for Azure Managed HSM

Cost-Optimised Add-on Scaling in AKS: Right-Size Your System Add-ons (Preview)

General Availability: Key Attestation for Azure Managed HSM

Prioritize safety with the new leaderboard

Navigate trade-offs with the quality-safety chart

Scenario-based responsible AI leaderboards

Building Trustworthy AI Starts with the Right Tools

Related posts

Empowering Youth Through STEM and Motorsports with Rajah Caruth

General Availability: Key Attestation for Azure Managed HSM

S2E01 Recap: Advanced Reasoning Session