Top models for speech recognition, text-to-speech, and audio.
Last updated: April 2026
ElevenLabs
Industry-leading text-to-speech with ultra-realistic voice cloning, multilingual support, and emotion control.
OpenAI
Latest speech recognition model with improved accuracy across 100+ languages, real-time streaming, and speaker diarization.
A cost-efficient version of GPT Audio. The new snapshot features an upgraded decoder for more natural sounding voices and maintains better voice consistency. Input is priced at $0.60 per million tokens and output is priced at $2.40 per million tokens.
Battle-tested speech recognition. Widely deployed, well-supported, excellent accuracy.
The gpt-4o-audio-preview model adds support for audio inputs as prompts. This enhancement allows the model to detect nuances within audio recordings and add depth to generated user experiences. Audio outputs are currently not supported. Audio tokens are priced at $40 per million input and $80 per million output audio tokens.
The gpt-audio model is OpenAI's first generally available audio model. The new snapshot features an upgraded decoder for more natural sounding voices and maintains better voice consistency. Audio is priced at $32 per million input tokens and $64 per million output tokens.