New in Azure Marketplace: April 10-21, 2025
May 3, 2025Transitioning to Microsoft Planner and retiring Microsoft Project for the web
May 3, 2025Azure OpenAI has expanded its speech recognition capabilities with two powerful models: GPT-4o-transcribe and GPT-4o-mini-transcribe. These models also leverage WebSocket connections to enable real-time transcription of audio streams, providing developers with cutting-edge tools for speech-to-text applications. In this technical blog, we’ll explore how these models work and demonstrate a practical implementation using Python.
Understanding OpenAI’s Realtime Transcription API
Unlike the regular REST API for audio transcription, Azure OpenAI’s Realtime API enables continuous streaming of audio data through WebSockets or WebRTC connections. This approach is particularly valuable for applications requiring immediate transcription feedback, such as live captioning, meeting transcription, or voice assistants.
The key difference between the standard transcription API and the Realtime API is that transcription sessions typically don’t contain responses from the model, but rather focus exclusively on converting speech to text in real-time.
GPT-4o-transcribe and GPT-4o-mini-transcribe: Feature Overview
Azure OpenAI has introduced two specialized transcription models:
- GPT-4o-transcribe: The full-featured transcription model with high accuracy
- GPT-4o-mini-transcribe: A lighter, faster model with slightly reduced accuracy but lower latency
Both models connect through WebSockets, enabling developers to stream audio directly from microphones or other sources for immediate transcription. These models are designed specifically for the Realtime API infrastructure.
Setting Up the Environment
First, we need to set up our Python environment with the necessary libraries:
import os
import json
import base64
import threading
import pyaudio
import websocket
from dotenv import load_dotenv
load_dotenv(‘azure.env’) # Load environment variables from .env
OPENAI_API_KEY = os.environ.get(“AZURE_OPENAI_STT_TTS_KEY”)
if not OPENAI_API_KEY:
raise RuntimeError(“❌ OPENAI_API_KEY is missing!”)
# WebSocket endpoint for OpenAI Realtime API (transcription model)
url = f”{os.environ.get(‘AZURE_OPENAI_STT_TTS_ENDPOINT’).replace(‘https’, ‘wss’)}/openai/realtime?api-version=2025-04-01-preview&intent=transcription”
headers = { “api-key”: OPENAI_API_KEY}
# Audio stream parameters (16-bit PCM, 16kHz mono)
RATE = 24000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK = 1024
audio_interface = pyaudio.PyAudio()
stream = audio_interface.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
Establishing the WebSocket Connection
The following code establishes a connection to OpenAI’s Realtime API and configures the transcription session:
def on_open(ws):
print(“Connected! Start speaking…”)
session_config = {
“type”: “transcription_session.update”,
“session”: {
“input_audio_format”: “pcm16”,
“input_audio_transcription”: {
“model”: “gpt-4o-mini-transcribe”,
“prompt”: “Respond in English.”
},
“input_audio_noise_reduction”: {“type”: “near_field”},
“turn_detection”: {“type”: “server_vad”, “threshold”: 0.5, “prefix_padding_ms”: 300, “silence_duration_ms”: 200}
}
}
ws.send(json.dumps(session_config))
def stream_microphone():
try:
while ws.keep_running:
audio_data = stream.read(CHUNK, exception_on_overflow=False)
audio_base64 = base64.b64encode(audio_data).decode(‘utf-8’)
ws.send(json.dumps({
“type”: “input_audio_buffer.append”,
“audio”: audio_base64
}))
except Exception as e:
print(“Audio streaming error:”, e)
ws.close()
threading.Thread(target=stream_microphone, daemon=True).start()
Processing Transcription Results
This section handles the incoming WebSocket messages containing the transcription results:
def on_message(ws, message):
try:
data = json.loads(message)
event_type = data.get(“type”, “”)
print(“Event type:”, event_type)
#print(data)
# Stream live incremental transcripts
if event_type == “conversation.item.input_audio_transcription.delta”:
transcript_piece = data.get(“delta”, “”)
if transcript_piece:
print(transcript_piece, end=’ ‘, flush=True)
if event_type == “conversation.item.input_audio_transcription.completed”:
print(data[“transcript”])
if event_type == “item”:
transcript = data.get(“item”, “”)
if transcript:
print(“nFinal transcript:”, transcript)
except Exception:
pass # Ignore unrelated events
Error Handling and Cleanup
To ensure proper resource management, we implement handlers for errors and connection closing:
def on_error(ws, error):
print(“WebSocket error:”, error)
def on_close(ws, close_status_code, close_msg):
print(“Disconnected from server.”)
stream.stop_stream()
stream.close()
audio_interface.terminate()
Running the WebSocket Client
Finally, this code initiates the WebSocket connection and starts the transcription process:
print(“Connecting to OpenAI Realtime API…”)
ws_app = websocket.WebSocketApp(
url,
header=headers,
on_open=on_open,
on_message=on_message,
on_error=on_error,
on_close=on_close
)
ws_app.run_forever()
Analyzing the Implementation Details
Session Configuration
Let’s break down the key components of the session configuration:
- input_audio_format: Specifies “pcm16” for 16-bit PCM audio
- input_audio_transcription:
- model: Specifies “gpt-4o-mini-transcribe” (could be replaced with “gpt-4o-transcribe” for higher accuracy)
- prompt: Provides instructions to the model (“Respond in English”)
- language: specify the language like “hi” else you can set it null to default to all language.
- input_audio_noise_reduction: Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
- turn_detection: Configures “server_vad” (Voice Activity Detection) to automatically detect speech turns. Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. Semantic VAD is more advanced and uses a turn detection model (in conjuction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with “uhhm”, the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
Audio Streaming
The implementation uses a threaded approach to continuously stream audio data from the microphone to the WebSocket connection. Each chunk of audio is:
- Read from the microphone
- Encoded to base64
- Sent as a JSON message with the “input_audio_buffer.append” event type
Transcription Events
The system processes several types of events from the WebSocket connection:
- conversation.item.input_audio_transcription.delta: Incremental updates to the transcription
- conversation.item.input_audio_transcription.completed: Complete transcripts for a segment
- item: Final transcription results
Customization Options
The example code can be customized in several ways:
- Switch between models (gpt-4o-transcribe or gpt-4o-mini-transcribe)
- Adjust audio parameters (sample rate, channels, chunk size)
- Modify the prompt to provide context or language preferences
- Configure noise reduction for different environments
- Adjust turn detection for different speaking patterns
Deployment Considerations
When deploying this solution in production, consider:
- Authentication: Securely store and retrieve API keys
- Error handling: Implement robust reconnection logic
- Performance: Optimize audio parameters for your use case
- Rate limits: Be aware of Azure OpenAI’s rate limits for the Realtime API
- Fallback strategies: Implement fallbacks for connection drops
Conclusion
GPT-4o-transcribe and GPT-4o-mini-transcribe represent significant advances in real-time speech recognition technology. By leveraging WebSockets for continuous audio streaming, these models enable developers to build responsive speech-to-text applications with minimal latency.
The implementation showcased in this blog demonstrates how to quickly set up a real-time transcription system using Python. This foundation can be extended for various applications, from live captioning and meeting transcription to voice-controlled interfaces and accessibility tools.
As these models continue to evolve, we can expect even better accuracy and performance, opening up new possibilities for speech recognition applications across industries.
Remember that when implementing these APIs in production environments, you should follow Azure OpenAI’s best practices for API security, including proper authentication and keeping your API keys secure.
Here is the link to end to end code.
Thanks
Manoranjan Rajguru
https://www.linkedin.com/in/manoranjan-rajguru/