By any conventional startup logic, raising a $70 million seed round sounds improbable. Double that improbability for a business just exiting from stealth after being founded in September.
But Gradium, a new foundational voice-AI startup co-founded by former DeepMind and Meta researchers, has never operated by conventional logic. In fact, its founders have spent the past several years building algorithms that now underpin much of voice technology.
“There are a lot of businesses around voice AI now, but developing very strong models for transcription, synthesis, the technological layer of AI, is very difficult,” Neil Zeghidour, founder at Gradium, said during a discussion hosted by PYMNTS CEO Karen Webster. “Only a few people in the world know how to do it properly. In our case, we have invented most of the technological steps and algorithms that are powering current technology.”
And now, they are entering the market as a challenger.
“We have to create our space in this very fast,” Zeghidour said, to “make everyone realize how serious we are about being a challenger.”
Still, the company faces a tall mountain to climb when it comes to commercialization. Consumer sentiment about voice assistants has long been ambivalent. Voice is the most intuitive interface of all, yet it all too often feels frustratingly brittle.
Advertisement: Scroll to Continue
“Voice [assistants] have been around for a long time, and I think we’re all frustrated with voice because it’s impossible to have a conversation. … It’s very keyword-driven,” Webster said.
“One of our main theses is that the potential of voice AI is mostly unrealized today. And one reason is because the interaction is too brutal,” Zeghidour agreed.
He described a familiar litany of shortcomings: systems that interrupt users mid-sentence, models that misjudge when someone has finished speaking, synthetic voices that respond with wildly inappropriate emotional tones. Even tasks as simple as making an appointment break down under the weight of latency and inaccuracy.
But with major funding and a polished commercial architecture, Gradium believes that solving these problems is not an incremental improvement in user experience but a reengineering of the science behind voice AI models that focuses on four pillars: accuracy, latency, conversational flow and expressive synthesis.
Why Voice AI Feels Broken and How to Fix It
While Gradium’s $70 million seed round is impressive, it still pales in comparison to the billions that tech giants like Amazon, Apple and others have poured into their own, often underwhelming, voice AI systems.
Zeghidour didn’t speculate about the internal models of those companies, but he highlighted the stagnation. “Look at something like transcription — it’s been around for 30 years and it’s still not there.”
This is not an indictment of manpower; it’s an indictment of architecture, says Zeghidour. Voice assistants are built not only on voice models but on the intelligence of the underlying large language models.
“We are at a point where we can accelerate the progress significantly,” Zeghidour said, stressing that Gradium’s own fundamental breakthrough is algorithmic: a more efficient and powerful audio-language modeling approach, solving for the “voice” layer, which he emphasized the company “invented and is the best at.”
“We got our first revenue in six weeks,” Zeghidour added. “The models were still training and already judged by beta testers as superior to competitors.”
One of the most important questions facing the industry is how voice and visual context will converge. Webster framed it: People may speak their instructions, but they often need to see something to verify or approve it.
To meet this end-user need and deliver a scalable voice experience, Gradium is taking a pragmatic approach: a cascaded system, in which real-time transcription and synthesis wrap around any text-based or visual-language model.
“You can take a VLM [voice language model] that understands images,” Zeghidour said, “and we just add our real-time transcription and real-time synthesis. Now you can have a conversation about images.”
This approach uses the text model as the “central processing unit,” with voice as the input and output layers. It is not the most philosophically elegant solution; and Gradium has already demonstrated systems that bypass text entirely, doing speech-to-speech or vision-to-speech modeling. But for customers building commercial systems today, the cascaded method can offer maximal compatibility and speed.
The goal, Zeghidour emphasized, is simple: “A voice layer that turns any text or vision model into a commercial AI.”
Future of AI Runs Through Voice
Gradium’s go-to-market strategy is unapologetically B2B, targeting customers that include companies building customer-support agents, medical-appointment systems, coaching platforms, e-learning tools, and any enterprise workflow that depends on conversation.
“We sell API access for transcription and synthesis to people building voice agents,” Zeghidour said.
Just as importantly, Gradium believes it can break a long-standing market dichotomy: the choice between high-quality but expensive voice systems and affordable but low-quality ones. The company intends to collapse that tradeoff entirely: quality on par with best-in-class systems, priced like commodity infrastructure.
Webster framed the stakes succinctly. “Voice is the ubiquitous interface … everyone can use it.” And because it’s the most human interface, expectations are higher. “People will use their voice to express themselves, to buy, to ask for help … and it’s profoundly important to get it right.”
The company sees 2026 as the horizon for tackling the deeper technological limitations plaguing voice AI. But the message of its launch is unmistakable: Voice may be about to become the interface layer for the entire AI economy, and Gradium wants to be the infrastructure powering it.
“It was very clear during the fundraising that there are clear expectations about, you know, having a shorter path to revenue for AI companies, even if they’re a foundational model company, even if they’re founded by research scientists from DeepMind and Meta,” Zeghidour said. “There will be things to improve, things to fix with our models, but this is the trajectory that we are on. It’s a path to tackle technological limitations of voice AI that the others have not been able to tackle.”
Source: https://www.pymnts.com/
