Mastering Agent Governance in Microsoft 365
May 21, 2025What’s new in Microsoft 365 Copilot | May 2025
May 21, 2025Announcement
Along with Kubernetes AI Toolchain Operator (KAITO) on AKS GA release, we are thrilled to announce Public Preview refresh for KAITO on AKS on Azure Local. Customers can now enable KAITO as a cluster extension on AKS enabled by Azure Arc as part of cluster creation or day 2 using Az CLI. The seamless enablement experience makes it easy to get started with LLM deployment and fully consistent with AKS in the cloud. We also invest heavily to reduce frictions in LLM deployment such as recommending the right GPU SKU, validating preset models with GPUs and avoiding Out of Memory errors, etc.
KAITO Use Cases
Many of our lighthouse customers are exploring exciting opportunities to build, deploy and run AI Apps at the edge. We’ve seen many interesting scenarios like Pipeline Leak detection, Shrinkage detection, Factory line optimization or GenAI Assistant across many industry verticals. All these scenarios need a local AI model with edge data to satisfy low latency or regulatory requirements. With one simple command, customers can quickly get started with LLM in the edge-located Kubernetes cluster, and ready to deploy OSS models with OpenAI-compatible endpoints.
Deploy & fine-tune LLM declaratively
With KAITO extension, customers can author a simple YAML for inference workspace in Visual Studio Code or any text editor and deploy a variety of preset models ranging from Phi-4, Mistral, to Qwen with kubectl on any supported GPUs. In addition, customers can deploy any vLLM compatible text generation model from Hugging Face or even private weights models by following custom integration instructions.
You can also customize base LLMs in the edge Kubernetes with Parameter Efficient Fine Tuning (PEFT) using qLoRA or LoRA method, just like the inference workspace deployment with YAML file. For more details, please visit the product documentation and KAITO Jumpstart Drops for more details.
Compare and evaluate LLMs in AI Toolkit
Customers can now use AI Toolkit, a popular extension in Visual Studio Code, to compare and evaluate LLMs whether it’s local or remote endpoint. With AI Toolkit playground and Bulk Run features, you can test and compare LLMs side by side and find out which model fits the best for your edge scenario. In addition, there are many built-in LLM Evaluators such as Coherence, Fluency, or Relevance that can be used to analyze model performance and generate numeric scores. For more details, please visit AI Toolkit Overview document.
Monitor inference metrics in Managed Grafana
The KAITO extension defaults to vLLM inference runtime. With vLLM runtime, customers can now monitor and visualize inference metrics with Azure Managed Prometheus and Azure Managed Grafana. Within a few configuration steps, e.g., enabling the extensions, labeling inference workspace, creating Service Monitor, the vLLM metrics will show up in Azure Monitor Workspace. To visualize them, customers can link the Grafana dashboard to Azure Monitor Workspace and view the metrics using the community dashboard. Please view product document and vLLM metric reference for more details.
Get started today
The landscape of LLM deployment and application is evolving at lightning speed – especially in the world of Kubernetes. With the KAITO extension, we’re aiming to supercharge innovation around LLMs and streamline the journey from ideation to model endpoints to real-world impact. Dive into this blog as well as KAITO Jumpstart Drops to explore how KAITO can help you get up and running quickly on your own edge Kubernetes cluster. We’d love to hear your thoughts – drop your feedback or suggestions in the KAITO OSS Repo!