Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology
June 6, 2025DeepSeek-R1-0528 is now available on Azure AI Foundry
June 6, 2025Thanks to their massive scale and impressive technical evolution, large language models (LLMs) have become the public face of Generative AI innovation. However, bigger isn’t always better. While LLMs like the ones behind Microsoft Copilot are incredibly capable at a wide range of tasks, less-discussed small language models (SLMs) expand the utility of Gen AI for real-time and edge applications. SLMs can run efficiently on a local device with low power consumption and fast performance, enabling new scenarios and cost models.
SLMs can run on universally available chips like CPUs and GPUs, but their potential really comes alive running on Neural Processing Units (NPUs), such as the ones found in Microsoft Surface Copilot+ PCs. NPUs are specifically designed for processing machine learning workloads, leading to high performance per watt and thermal efficiency compared to CPUs or GPUs [1]. SLMs and NPUs together support running quite powerful Gen AI workloads efficiently on a laptop, even when running on battery power or multitasking.
In this blog, we focus on running SLMs on Snapdragon® X Plus processors on the recently launched Surface Laptop 13-inch, using the Qualcomm® AI Hub, leading to efficient local inference, increased hardware utilization and minimal setup complexity. This is only one of many methods available – before diving into this specific use case, let’s first provide an overview of the possibilities for deploying small language models on Copilot+ PC NPUs.
- Qualcomm AI Engine Direct (QNN) SDK: This process requires converting SLMs into QNN binaries that can be executed through the NPU. The Qualcomm AI Hub provides a convenient way to compile any PyTorch, TensorFlow, or ONNX-converted models into QNN binaries executable by the Qualcomm AI Engine Direct SDK. Various precompiled models are directly available in the Qualcomm AI Hub, their collection of over 175 pre-optimized models, ready for download and integration into your application.
- ONNX Runtime: ONNX Runtime is an open-source inference engine from Microsoft designed to run models in the ONNX format. The QNN Execution Provider (EP) by Qualcomm Technologies optimizes inference on Snapdragon processors using AI acceleration hardware, mainly for mobile and embedded use.
ONNX Runtime Gen AI is a specialized version optimized for generative AI tasks, including transformer-based models, aiming for high-performance inference in applications like large language models. Although ONNX Runtime with QNN EP can run models on Copilot+ PCs, some operator support is missing for Gen AI workloads. ONNX Runtime Gen AI is not yet publicly available for NPU; a private beta is currently out with an unclear ETA on public release at the time of releasing this blog. Here is the link to the Git repo for more info on upcoming releases microsoft/onnxruntime-genai: Generative AI extensions for onnxruntime - Windows AI Foundry: Windows AI Foundry provides AI-supported features and APIs for Copilot+ PCs. It includes pre-built models such as Phi-Silica that can be inferred using Windows AI APIs. Additionally, it offers the capability to download models from the cloud for local inference on the device using Foundry Local. This feature is still in preview. You can learn more about Windows AI Foundry here: Windows AI Foundry | Microsoft Developer
- AI Toolkit for VS Code: The AI Toolkit for Visual Studio Code (VS Code) is a VS Code extension that simplifies generative AI app development by bringing together cutting-edge AI development tools and models from the Azure AI Foundry catalog and other catalogs like Hugging Face. This platform allows users to download multiple models either from the cloud or locally. It currently houses several models optimized to run on CPU, with support for NPU-based models forthcoming, starting with Deepseek R1.
Comparison between different approaches
Feature | Qualcomm AI Hub | ONNX Runtime (ORT) | Windows AI Foundry | AI Toolkit for VS code |
---|---|---|---|---|
Availability of Models |
Wide set of AI models (vision, Gen AI, object detection, and audio). |
Any models can be integrated. NPU support for Gen AI tasks and ONNX Gen AI Runtime are not yet generally available. |
Phi Silica model is available through Windows AI APIs, additional AI models from cloud can be downloaded for local inference using Foundry Local |
Access to models from sources such as Azure AI Foundry and Hugging Face. Currently only supports Deepseek R1 and Phi 4 Mini models for NPU inference. |
Ease of development |
The API is user-friendly once the initial setup and end-to-end replication are complete. |
Simple setup, developer-friendly; however, limited support for custom operators means not all models deploy through ORT.
|
Easiest framework to adopt—developers familiar with Windows App SDK face no learning curve.
|
Intuitive interface for testing models via prompt-response, enabling quick experimentation and performance validation. |
Is processor or SoC independent |
No. Supports Qualcomm Technologies processors only. Models must be compiled and optimized for the specific SOC on the device. A list of supported chipsets is provided, and the resulting .bin files are SOC-specific. |
Limitations exist with QNN EP’s HTP backend: only quantized models and those with static shapes are currently supported. |
Yes. The tool can operate independently of SoC. It is part of the broader Windows Copilot Runtime framework, now rebranded as the Windows AI Foundry. |
Model-dependent. Easily deployable on-device; model download and inference are straightforward.
|
As of writing this article and based on our team’s research, we found Qualcomm AI Hub to be the most user-friendly and well-supported solution available at this time. In contrast, most other frameworks are still under development and not yet generally available.
Before we dive into how to use Qualcomm AI Hub to run Small Language Models (SLMs), let’s first understand what Qualcomm AI Hub is.
What is Qualcomm AI Hub?
Qualcomm AI Hub is a platform designed to simplify the deployment of AI models for vision, audio, speech, and text applications on edge devices. It allows users to upload, optimize, and validate their models for specific target hardware—such as CPU, GPU, or NPU—within minutes. Models developed in PyTorch or ONNX are automatically converted for efficient on-device execution using frameworks like TensorFlow Lite, ONNX Runtime, or Qualcomm AI Engine Direct. The Qualcomm AI Hub offers access to a collection of over 100 pre-optimized models, with open-source deployment recipes available on GitHub and Hugging Face. Users can also test and profile these models on real devices with Snapdragon and Qualcomm platforms hosted in the cloud.
In this blog we will be showing how you can use Qualcomm AI Hub to get a QNN context binary for models and use Qualcomm AI Engine to run those context binaries. The context binary is a SoC-specific deployment mechanism. When compiled for a device, it is expected that the model will be deployed to the same device. The format is operating system agnostic so the same model can be deployed on Android, Linux, or Windows. The context binary is designed only for the NPU. For more details on how to compile models in other formats, please visit the documentation here Overview of Qualcomm AI Hub — qai-hub documentation.
The following case study details the efficient execution of the Phi-3.5 model using optimized, hardware-specific binaries on a Surface Laptop 13-inch powered by the Qualcomm Snapdragon X Plus processor, Hexagon™ NPU, and Qualcomm Al Hub.
Microsoft Surface Engineering Case Study: Running Phi-3.5 Model Locally on Snapdragon X Plus on Surface Laptop 13-inch
This case study details how the Phi-3.5 model was deployed on a Surface Laptop 13-inch powered by the Snapdragon X Plus processor. The study was developed and documented by the Surface DASH team, which specializes in delivering AI/ML solutions to Surface devices and generating data-driven insights through advanced telemetry. Using Qualcomm AI Hub, we obtained precompiled QNN context binaries tailored to the target SoC, enabling efficient local inference. This method maximizes hardware utilization and minimizes setup complexity.
We used a Surface Laptop 13-inch with the Snapdragon X Plus processor as our test device. The steps below apply to the Snapdragon X Plus processor; however, the process remains similar for other Snapdragon X Series processors and devices as well. For the other processors, you may need to download different model variants of the desired models from Qualcomm AI Hub. Before you begin to follow along, please check the make and models of your NPU by navigating to Device Manager –> Neural Processors.
We also used Visual Studio Code and Python (3.10.3.11, 3.12). We used the 3.11 version to run these steps below and recommend using the same, although there should be no difference in using a higher Python version.
Before starting, let’s create a new virtual environment in Python as a best practice. Follow the steps to create a new virtual environment here:
https://code.visualstudio.com/docs/python/environments?from=20423#_creating-environments
Create a folder named ‘genie_bundle’ store config and bin files. Download the QNN context binaries specific to your NPU and place the config files into the genie_bundle folder. Copy the .dll files from QNN SDK into the genie_bundle folder. Finally, execute the test prompt through genie-sdk in the required format for Phi-3.5.
Setup steps in details
Step 1: Setup local development environment
Download QNN SDK: Go to the Qualcomm Software Center Qualcomm Neural Processing SDK | Qualcomm Developer and download the QNN SDK by clicking on Get Software (by default latest version of SDK gets downloaded). For the purpose of this demo, we used latest version available (2.34) . You may need to make an account on the Qualcomm website to access it.
Step 2: Download QNN Context Binaries from Qualcomm AI Hub Models
- Download Binaries: Download the context binaries (.bin files) for the Phi-3.5-mini-instruct model from (Link to Download Phi-3.5 context binaries).
- Clone AI Hub Apps repo: Use the Genie SDK (Generative Runtime built on top of Qualcomm AI Direct Engine), and leverage the sample provided in https://github.com/quic/ai-hub-apps
- Setup folder structure to follow along the code: Create a folder named “genie_bundle” outside of the folder where AI Hub Apps repo was cloned. Selectively copy configuration files from AI Hub sample repo to ‘genie_bundle’
Step 3: Copy config files and edit files
- Copy config files to genie_bundle folder from ai-hub-apps. You will need two config files. You can use the PowerShell script below to copy the config files from repo to local genie folder created in previous steps. You also need to copy HTP backend config file as well as the genie config file from the repo
# Define the source paths
$sourceFile1 = “ai-hub-apps/tutorials/llm_on_genie/configs/htp/htp_backend_ext_config.json.template”
$sourceFile2 = “ai-hub-apps/tutorials/llm_on_genie/configs/genie/phi_3_5_mini_instruct.json”
# Define the local folder path
$localFolder = “genie_bundle”
# Define the destination file paths using the local folder
$destinationFile1 = Join-Path -Path $localFolder -ChildPath “htp_backend_ext_config.json”
$destinationFile2 = Join-Path -Path $localFolder -ChildPath “genie_config.json”
# Create the local folder if it doesn’t exist
if (-not (Test-Path -Path $localFolder)) {
New-Item -ItemType Directory -Path $localFolder
}
# Copy the files to the local folder
Copy-Item -Path $sourceFile1 -Destination $destinationFile1 -Force
Copy-Item -Path $sourceFile2 -Destination $destinationFile2 -Force
Write-Host “Files have been successfully copied to the genie_bundle folder with updated names.”
- After copying the files, you will need to make sure to change the default values of the parameters provided with template files copied.
- Edit HTP backend file in the newly pasted location – Change dsp_arch and soc_model to match with your configuration
Update soc model and dsp arch in HTP backend config files
- Edit genie_config file to include the downloaded binaries for Phi 3 models in previous steps
Update location to match with your downloaded binaries files. Explicitly set the ‘use-map’ parameter to false.
Step 4: Download the tokenizer file from Hugging Face
- Visit the Hugging Face Website: Open your web browser and go to https://huggingface.co/microsoft/Phi-3.5-mini-instruct/tree/main
- Locate the Tokenizer File: On the Hugging Face page, find the tokenizer file for the Phi-3.5-mini-instruct model
- Download the File: Click on the download button to save the tokenizer file to your computer
- Save the File: Navigate to your genie_bundle folder and save the downloaded tokenizer file there.
Note: There is an issue with the tokenizer.json file for the Phi 3.5 mini instruct model, where the output does not break words using spaces. To resolve this, you need to delete lines #192-197 in the tokenizer.json file.
Download tokenizer files from the hugging face repo (Image Source – Hugging Face)
Step 5: Copy files from QNN SDK
- Locate the QNN SDK Folder: Open the folder where you have installed the QNN SDK in step 1 and identify the required files. You need to copy the files from the below mentioned folder. Exact folder naming may change based on SDK version
- /qairt/2.34.0.250424/lib/hexagon-v75/unsigned
- /qairt/2.34.0.250424/lib/aarch64-windows-msvc
- /qairt/2.34.0.250424/bin/aarch64-windows-msvc
- Navigate to your genie_bundle folder and paste the copied files there.
Step 6: Execute the Test Prompt
- Open Your Terminal: Navigate to your genie_bundle folder using your terminal or command prompt.
- Run the Command: Copy and paste the following command into your terminal:
./genie-t2t-run.exe -c genie_config.json -p “nYou are an assistant. Provide helpful and brief responses.nWhat is an NPU? nnn”
- Check the Output: After running the command, you should see the response from the assistant in your terminal.
- Check the Output: After running the command, you should see the response from the assistant in your terminal.
This case study demonstrates the process of deploying a small language model (SLM) like Phi-3.5 on a Copilot+ PC using the Hexagon NPU and Qualcomm AI Hub. It outlines the setup steps, tooling, and configuration required for local inference using hardware-specific binaries. As deployment methods mature, this approach highlights a viable path toward efficient, scalable Gen AI execution directly on edge devices.
Snapdragon® and Qualcomm® branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm, Snapdragon and Hexagon™ are trademarks or registered trademarks of Qualcomm Incorporated.