
General Availability: App Service Webjobs on Linux
May 1, 2025Showcasing Phi-4-Reasoning: A Game-Changer for AI Developers
May 1, 2025In situations with limited computing, Phi-4-mini-reasoning will is an excellent model choice. We can use Microsoft Olive or Apple MLX Framework to quantize Phi-4-mini-reasoning and deploy it on edge terminals such as IoT, Laotop and mobile devices.
Quantization
In order to solve the problem that the model is difficult to deploy directly to specific hardware, we need to reduce the complexity of the model through model quantization. Undertaking the quantization process will inevitably cause precision loss.
Quantize Phi-4-mini-reasoning using Microsoft Olive
Microsoft Olive is an AI model optimization toolkit for ONNX Runtime. Given a model and target hardware, Olive (short for Onnx LIVE) will combine the most appropriate optimization techniques to output the most efficient ONNX model for inference in the cloud or on the edge. We can combine Microsoft Olive and Phi-4-mini-reasoning on Azure AI Foundry’s Model Catalog to quantize Phi-4-mini-reasoning to an ONNX format model.
- Create your Notebook on Azure ML
- Install Microsoft Olive
pip install git+https://github.com/Microsoft/Olive.git
Quantize using Microsoft Olive
olive auto-opt –model_name_or_path {Azure Model Catalog path ,such as azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1 }–device cpu –provider CPUExecutionProvider –use_model_builder –precision int4 –output_path ./phi-4-14b-reasoninig-onnx –log_level 1
Register your quantized Model
! python -m mlx_lm.generate –model ./phi-4-mini-reasoning –adapter-path ./adapters –max-token 4096 –prompt “A 54-year-old construction worker with a long history of smoking presents with swelling in his upper extremity and face, along with dilated veins in this region. After conducting a CT scan and venogram of the neck, what is the most likely diagnosis for the cause of these symptoms?” –extra-eos-token “”
Download to local and run
Download the onnx model to local device
ml_client.models.download(“phi-4-mini-onnx-int4-cpu”, 1) Running onnx model with onnxruntime-genai Install onnxruntime-genai (This is CPU version) pip install onnxruntime-genai Run it import onnxruntime_genai as og model_folder = “Your ONNX Model Path” model = og.Model(model_folder) tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() search_options = {} search_options[‘max_length’] = 32768 chat_template = “{input}” text = ‘A school arranges dormitories for students. If each dormitory accommodates 5 people, 4 people cannot live there; if each dormitory accommodates 6 people, one dormitory only has 4 people, and two dormitories are empty. Find the number of students in this grade and the number of dormitories.’ prompt = f'{chat_template.format(input=text)}’ input_tokens = tokenizer.encode(prompt) params = og.GeneratorParams(model) params.set_search_options(**search_options) generator = og.Generator(model, params) generator.append_tokens(input_tokens) while not generator.is_done(): generator.generate_next_token() new_token = generator.get_next_tokens()[0] print(tokenizer_stream.decode(new_token), end=”, flush=True)
Get Notebook from Phi Cookbook : https://aka.ms/phicookbook
Quantize Phi-4-mini-reasoning model using Apple MLX
Install Apple MLX Framework
pip install -U mlx-lm
Convert Phi-4-mini-reasoning model through Apple MLX quantization
python -m mlx_lm.convert –hf-path {Phi-4-mini-reasoning Hugging face id} -q
Run Phi-4-mini-reasoning with Apple MLX in terminal
python -m mlx_lm.generate –model ./mlx_model –max-token 2048 –prompt “A school arranges dormitories for students. If each dormitory accommodates 5 people, 4 people cannot live there; if each dormitory accommodates 6 people, one dormitory only has 4 people, and two dormitories are empty. Find the number of students in this grade and the number of dormitories.” –extra-eos-token “” –temp 0.0
Fine-tuning
We can fine-tune the CoT data of different scenarios to enable Phi-4-mini-reasoning to have reasoning capabilities for different scenarios. Here we use the Medical CoT data from a public Huggingface datasets as our example (this is just an example. If you need rigorous medical reasoning, please seek more professional data support)
We can fine-tune our CoT data in Azure ML
Fine-tune Phi-4-mini-reasoning using Microsoft Olive in Azure ML
Note- Please use Standard_NC24ads_A100_v4 to run this sample
- Get Data from Hugging face datasets
pip install datasets run this script to get train data from datasets import load_dataset def formatting_prompts_func(examples): inputs = examples[“Question”] cots = examples[“Complex_CoT”] outputs = examples[“Response”] texts = [] for input, cot, output in zip(inputs, cots, outputs): text = prompt_template.format(input, cot, output) + “” # text = prompt_template.format(input, cot, output) + “” texts.append(text) return { “text”: texts, } # Create the English dataset dataset = load_dataset(“FreedomIntelligence/medical-o1-reasoning-SFT”,”en”, split = “train”,trust_remote_code=True) dataset = dataset.map(formatting_prompts_func, batched = True,remove_columns=[“Question”, “Complex_CoT”, “Response”]) dataset.to_json(“en_dataset.jsonl”)
Fine-tuning with Microsoft Olive
olive finetune –method lora –model_name_or_path {Azure Model Catalog path , azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1} –trust_remote_code –data_name json –data_files ./en_dataset.jsonl –train_split “train[:16000]” –eval_split “train[16000:19700]” –text_field “text” –max_steps 100 –logging_steps 10 –output_path {Your fine-tuning save path} –log_level 1
Convert the model to ONNX with Microsoft Olive
olive capture-onnx-graph –model_name_or_path {Azure Model Catalog path , azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1} –adapter_path {Your fine-tuning adapter path} –use_model_builder –output_path {Your save onnx path} –log_level 1 olive generate-adapter –model_name_or_path {Your save onnx path} –output_path {Your save onnx adapter path} –log_level 1
Run the model with onnxruntime-genai-cuda
Install onnxruntime-genai-cuda SDK import onnxruntime_genai as og import numpy as np import os model_folder = “./models/phi-4-mini-reasoning/adapter-onnx/model/” model = og.Model(model_folder) adapters = og.Adapters(model) adapters.load(‘./models/phi-4-mini-reasoning/adapter-onnx/model/adapter_weights.onnx_adapter’, “en_medical_reasoning”) tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() search_options = {} search_options[‘max_length’] = 200 search_options[‘past_present_share_buffer’] = False search_options[‘temperature’] = 1 search_options[‘top_k’] = 1 prompt_template = “””{}””” question = “”” A 33-year-old woman is brought to the emergency department 15 minutes after being stabbed in the chest with a screwdriver. Given her vital signs of pulse 110/min, respirations 22/min, and blood pressure 90/65 mm Hg, along with the presence of a 5-cm deep stab wound at the upper border of the 8th rib in the left midaxillary line, which anatomical structure in her chest is most likely to be injured? “”” prompt = prompt_template.format(question, “”) input_tokens = tokenizer.encode(prompt) params = og.GeneratorParams(model) params.set_search_options(**search_options) generator = og.Generator(model, params) generator.set_active_adapter(adapters, “en_medical_reasoning”) generator.append_tokens(input_tokens) while not generator.is_done(): generator.generate_next_token() new_token = generator.get_next_tokens()[0] print(tokenizer_stream.decode(new_token), end=”, flush=True) inference model with onnxruntime-genai cuda olive finetune –method lora –model_name_or_path {Azure Model Catalog path , azureml://registries/azureml/models/Phi-4-mini-reasoning/versions/1} –trust_remote_code –data_name json –data_files ./en_dataset.jsonl –train_split “train[:16000]” –eval_split “train[16000:19700]” –text_field “text” –max_steps 100 –logging_steps 10 –output_path {Your fine-tuning save path} –log_level 1
Fine-tune Phi-4-mini-reasoning using Apple MLX locally on MacOS
Note- we recommend that you use devices with a minimum of 64GB Memory and Apple Silicon devices
- Get the DataSet from Hugging face datasets
pip install datasets run this script to get train and valid data from datasets import load_dataset prompt_template = “””{}{}{}””” def formatting_prompts_func(examples): inputs = examples[“Question”] cots = examples[“Complex_CoT”] outputs = examples[“Response”] texts = [] for input, cot, output in zip(inputs, cots, outputs): # text = prompt_template.format(input, cot, output) + “” text = prompt_template.format(input, cot, output) + “” texts.append(text) return { “text”: texts, } dataset = load_dataset(“FreedomIntelligence/medical-o1-reasoning-SFT”,”en”, trust_remote_code=True) split_dataset = dataset[“train”].train_test_split(test_size=0.2, seed=200) train_dataset = split_dataset[‘train’] validation_dataset = split_dataset[‘test’] train_dataset = train_dataset.map(formatting_prompts_func, batched = True,remove_columns=[“Question”, “Complex_CoT”, “Response”]) train_dataset.to_json(“./data/train.jsonl”) validation_dataset = validation_dataset.map(formatting_prompts_func, batched = True,remove_columns=[“Question”, “Complex_CoT”, “Response”]) validation_dataset.to_json(“./data/valid.jsonl”)
Fine-tuning with Apple MLX
python -m mlx_lm.lora –model ./phi-4-mini-reasoning –train –data ./data –iters 100
Running the model
! python -m mlx_lm.generate –model ./phi-4-mini-reasoning –adapter-path ./adapters –max-token 4096 –prompt “A 54-year-old construction worker with a long history of smoking presents with swelling in his upper extremity and face, along with dilated veins in this region. After conducting a CT scan and venogram of the neck, what is the most likely diagnosis for the cause of these symptoms?” –extra-eos-token “”
Get Notebook from Phi Cookbook : https://aka.ms/phicookbook
We hope this sample has inspired you to use Phi-4-mini-reasoning and Phi-4-reasoning to complete industry reasoning for our own scenarios.
Related resources
Phi4-mini-reasoning Tech Report https://aka.ms/phi4-mini-reasoning/techreport
Phi-4-Mini-Reasoning technical Report· microsoft/Phi-4-mini-reasoning
Phi-4-mini-reasoning on Azure AI Foundry https://aka.ms/phi4-mini-reasoning/azure
Phi-4 Reasoning Blog https://aka.ms/phi4-mini-reasoning/blog
Phi Cookbook https://aka.ms/phicookbook
Models
Phi-4 Reasoning https://huggingface.co/microsoft/Phi-4-reasoning
Phi-4 Reasoning Plus https://huggingface.co/microsoft/Phi-4-reasoning-plus
Phi-4-mini-reasoning Hugging Face https://aka.ms/phi4-mini-reasoning/hf
Phi-4-mini-reasoning on Azure AI Foundry https://aka.ms/phi4-mini-reasoning/azure
Microsoft (Microsoft) Models on Hugging Face
Phi-4 Reasoning Models Azure AI Foundry Models