New Course Now Pages Live for 1-Day, Non-Cert-Aligned ILTs
August 1, 2025Check This Out! (CTO!) Guide (July 2025)
August 1, 2025RFT Observability (“Auto-Evals”)
Reinforcement Finetuning (RFT) observability provides real-time, in-depth visibility into your RFT job by automatically kicking off evaluation (“auto-evals”) that shows the detailed finetuning progress at each checkpoint.
What is RFT?
RFT adjusts model weights using a reward model (grader) to score outputs against the reference data. The grader’s results are then used to reward or penalize the model response, steering the model’s reasoning direction and quality towards desired outcomes. The benefit of using RFT over supervised fine-tuning (SFT) is that RFT incorporates these reward signals real-time during training – but it’s slower and more expensive than SFT.
Previously, customers could only evaluate RFT results after a training run has been completed. This meant a wrong grader type selection or unexpected model behavior (e.g. reward hacking) could waste significant time and budget.
How does RFT Observability help?
Today, we’re making it easier to see what’s happening under the hood for RFT. When a RFT job starts, we automatically create a linked evaluation job. The evaluation exposes intermediate results (prompts & responses) and grades from RFT at each checkpoint step, so that you can have more visibility and control of training while the RFT job is running. This means you can now monitor, debug, and steer your RFT model while it is being trained, significantly reducing potential loss in cost and time.
Where can I find RFT Observability?
On Azure AI Foundry’s ‘Fine-tuning’ page, there is a new section ‘Evaluation’ under ‘Details’ that is visible if the method of customization is ‘Reinforcement’. This is the linked evaluation job that has been automatically created for all finetuning jobs that were RFT.
When you click ‘View Report’, you are taken to the evaluation job page, where you will see the grader scores for each checkpoint step. You can also toggle back to the RFT job by clicking ‘Linked finetuning job’ at the top of the page.
When you click ‘Logs’ on the finetuning page for the RFT job, you can track the progress of RFT Observability from the initial creation with a link to the evaluation job.
Quick Evaluation (Quick Evals) for Stored Completions
Quick Evaluation (Quick Evals) is a feature on Stored Completions that allows a user to rapidly assess model outputs with a minimal set-up. With a single click, user can assess model responses without launching a full evaluation job; there is no need to go through the traditional Evaluation flow of customizing test criteria or preparing prompts/responses, reducing manual work.
User can compare and evaluate the model responses against multiple models, without needing to add separate runs for each model. Quick Evals provides immediate feedback by showing scores and highlights for each output, helping users spot issues promptly. It is ideal for fast iteration, allowing instant insight into completion quality without navigating away from the Stored Completions interface.
Python Grader for Evaluation
We are introducing a new type of evaluation test criteria, Python grader. Python grader is a custom evaluation function to score model outputs automatically. The customizable Python code is used inspect specific parts of the model’s response, such as tool usage, structure, or content, and return a numeric score (typically between 0 and 1) based on the user’s pre-defined logic.
Python grader can be especially helpful in providing structured, programmable assessments of model outputs, and can be used for holistic model assessments when used in combination with other model-based or traditional graders.
Development Tier in General Availability
We’re excited to announce the promotion of Developer Tier to General Availability! Since our launch in Public Preview this past //Build (2025), we’ve worked to bring affordable fine-tuned model candidate hosting to all our Azure OpenAI regions as part of AI Foundry.
Today, in conjunction with Global Training (still in Public Preview), AI Foundry customers can complete the full train/test life cycle from over 25 different regions and with all the GPT-4.1-series models.
We’ve seen customers ship new models more confidently, knowing they have a deployment type supporting pre-production workloads, and are excited to continue building new features into the Fine Tuning “Developer Tier” experience in the coming months. Stay tuned because General Availability doesn’t mean we’re done!
In the meantime, learn how to take advantage of Developer Tier either via AI Foundry, Cognitive Services SDK, or the REST API.