OpenAI unveils new audio models to redefine voice AI with real-time speech capabilities

OpenAI has unveiled a new suite of audio models to power its voice agents, and it is now available to developers around the world. The latest updates mark a major step in voice AI technology. The AI powerhouse has introduced new tools and models that could enable developers to create voice agents, or AI-driven systems that are capable of real-time speech interactions.

Even though voice is a natural human interface, it remains largely underutilised in AI applications of today. With the slew of updates, OpenAI is aiming to change this, essentially enabling businesses and developers to create more sophisticated voice agents. These systems can function on their own, assisting users through spoken interactions under various use cases that could range from customer care to learning languages.

What’s new?

OpenAI has introduced three main advancements in audio AI. These are two state-of-the-art speech-to-text models, a new text-to-speech model, and some enhancements to the Agents SDK. The new speech-to-text models have outperformed OpenAI’s previous Whisper models in almost all tested languages, with significant improvements in transcription accuracy and efficiency.

On the other hand, the new text-to-speech model enables precise control over not just the spoken words but how they are said, enhancing the overall expressiveness of AI-generated speech. With the Agents SDK, the latest update makes it easier to convert text-based agents into voice-based AI assistants offering seamless interactions.

What do voice agents do?

Voice agents function similarly to text-based AI assistants. However, they operate through speech instead of text interactions. Some use cases include customer support, where AI answers calls and handles queries; language learning, where an AI-powered coach can help users with pronunciations and practise conversations; and accessibility tools, where they offer voice-controlled assistants for users with disabilities.

How to build voice AI?

When it comes to building voice AI, there are essentially two approaches – speech-to-speech (S2S) and speech-to-text-to-speech (S2T2S). S2S models take spoken input and produce spoken output without intermediate transcription. Reportedly, this approach maintains nuances like intonation, emotion, and emphasis. Meanwhile, S2T2S models initially transcribe speech as text, process it, and convert it back into speech. Although these are easier to implement, they often lose key details and may add latency. OpenAI’s latest updates emphasise the advantages of speech-to-speech processing, making AI interactions more natural and fluid.

Source: https://indianexpress.com/