Latest Developments in Speech Technology

Explore top LinkedIn content from expert professionals.

Summary

Speech technology refers to the systems and tools that allow machines to understand, produce, and interact using human language, including voice recognition, synthesis, translation, and conversational AI. The latest developments have made it possible for machines to handle multiple languages, interpret emotions, and enable real-time, natural conversations, opening doors for accessibility, global communication, and personalized interactions.

  • Expand language reach: New voice AI models are now able to support hundreds or even thousands of languages, making communication easier for people from diverse backgrounds.
  • Personalize interactions: Modern speech technology can mimic emotion and tone, allowing conversations to feel more human, and even helping users with disabilities regain their voice.
  • Streamline workflow: Real-time voice transcription and translation tools help businesses automate tasks and improve customer support, saving time and increasing productivity.
Summarized by AI based on LinkedIn member posts
  • Last week in voice AI🔥 The stack is getting deeper, faster, and more operationally critical. Here’s what stood out 👇 - Krisp launches VIVA 2.0 with Turn Prediction v3 and a first-of-its-kind Interrupt Prediction model, all running on CPU with no transcription required. - OpenAI launches three real-time audio models for its API: GPT-Realtime-2 with GPT-5-class reasoning, GPT-Realtime-Translate for live translation across 70+ languages, and GPT-Realtime-Whisper for streaming speech-to-text. - Twilio unveils a Conversation Layer at SIGNAL 2026 with persistent Memory, Orchestrator, Intelligence, and open-source Agent Connect for plugging in any AI provider. - Inworld AI ships Realtime TTS-2, a frontier voice model that reads user emotion and tone in real time and adapts pacing, softness, and empathy mid-conversation. - ServiceNow unveils Otto, a unified conversational AI layer combining Now Assist, Moveworks, and voice agents across every department and system via Ken Y. for The AI Economy - SoundHound AI launches OASYS, a self-learning agentic platform that auto-builds, orchestrates, and improves voice AI agents from documentation and transcripts. - ElevenLabs adds BlackRock, NVIDIA, and Jamie Foxx to its $550M+ Series D as annualized revenue crosses $500M, up from $350M at the end of 2025 via Ivan Mehta for TechCrunch - Greenhouse Software acquires Ezra AI Labs to bring voice AI interviewing into its ATS as applications per recruiter have spiked over 400% since 2023. - Ethos raises $22.75M from a16z for an expert network that onboards 35K people per week through voice AI interviews. - 8x8 launches AI Studio in early availability, letting teams describe needs in plain language and deploy voice and digital AI agents without adding vendors. - Wispr Flow bets on India as its fastest-growing market with Hinglish dictation support, 2.5M downloads, and 100% month-over-month growth via Jagmeet Singh for TechCrunch - ElevenLabs powers SpoonLabs’ audio novels, cutting production time from months to hours and launching PodNovel across Korea, Japan, and Taiwan. - eGain Corporation launches AI Agent IVA, a knowledge-powered virtual agent that replaces IVR dial trees with natural conversation and 24/7 voice support. - gnani.ai hires eight senior execs after its $10M Series B, processing over 30M voice AI calls daily for 200+ enterprise customers in India. - Vobiz AI.ai raises $1M seed to build AI-native telephony infrastructure in India with DID provisioning, low-latency SIP trunking, and LLM audio streaming. - Twinnin targets $3M seed round for its voice and face cloning marketplace where actors license digital likenesses to studios, backed by Google and NVIDIA. - BCM One partners with TD SYNNEX to bring Pure IP voice services and SkySwitch UCaaS to the MSP channel through the distributor’s partner network. - AI note-taking earbuds go mainstream as Viaim and Mobvoi ship wireless earbuds that record, transcribe, and summarize meetings.

  • View profile for Armand Ruiz
    Armand Ruiz Armand Ruiz is an Influencer

    building AI systems @meta

    207,041 followers

    Most voice AI systems ignore 90% of the world’s languages. Why? Because data is scarce. Meta’s new Omnilingual Speech Recognition suite breaks that cycle. Existing models are trained on internet-rich languages and that dominates the research loop. Omnilingual can transcribe speech in over 1,600 languages, including 500 that no speech AI has ever supported. This is a glimpse into the next wave of AI: models that don’t assume the internet is the world. Highlights: – Transcription accuracy under 10% error for 78% of supported languages – In-context learning: adapt to new languages with just a few audio clips – Fully open-source: models, data, and the 7B Omnilingual w2v 2.0 foundation This isn’t about just recognizing speech. It’s about who gets included. If we can build models that work across dialects, cultures, and scarce data, the future of voice AI in enterprise, customer service, and global markets changes fast. - Announcement blog: https://go.meta.me/ff13fa - Download Omnilingual ASR: https://lnkd.in/g3w4FqY3 - Try the Language Exploration Demo: https://lnkd.in/gVzrcdbd - Try the Transcription Tool: https://lnkd.in/gRdZuZqP - Read the Paper: https://lnkd.in/giKrvniC

  • View profile for Gary Monk
    Gary Monk Gary Monk is an Influencer

    LinkedIn ‘Top Voice’ >> Follow for the Latest Trends, Insights, and Expert Analysis in Digital Health & AI

    47,202 followers

    Brain Implant and AI Let Man with ALS Speak and Sing in Real Time Using His Own Voice: 🧠A brain implant and AI decoder has enabled Casey Harrell, a man with ALS, to speak and sing again using a voice that sounds like his own, with near-zero lag 🧠The system captures brain signals from four implanted electrode arrays as Harrell attempts to speak, decoding them into real-time speech with intonation, emphasis, and emotional nuance, down to interjections like “hmm” and “eww.” 🧠Unlike earlier BCIs that needed users to mime full sentences, this one works continuously, decoding signals every 10 milliseconds. That allows users to interrupt, express emotion, and feel more included in natural conversation 🧠It even lets Harrell modulate pitch to sing basic melodies and change meaning through intonation, like distinguishing a question from a statement or stressing different words in a sentence 🧠The synthetic voice was trained on recordings of Harrell’s real voice before ALS progressed, making the output feel deeply personal and familiar to him. 🧠While listener comprehension is around 60%, the system’s ability to express tone, emotion, and even made-up words marks a major leap beyond monotone speech—and could adapt to other languages, including tonal ones #healthtech #ai

  • View profile for Allys Parsons

    Co-Founder at techire ai. Hiring in AI since ’19 ✌️ Speech AI, TTS, Audio, Multimodal AI & more! Top 200 Women Leaders in Conversational AI ‘23 | No.1 Conversational AI Leader ‘21

    18,158 followers

    2025 was a big year for speech and audio. We saw much more research on full-duplex modeling. SpeechLLMs started to really take off and codec models proved they could stream at low latency. But what does 2026 have in store? Unified any-to-any models - We're seeing a clear shift from pipeline architectures (ASR→LLM→TTS) to single end-to-end models that handle multiple modalities. We've seen this with Qwen2.5-Omni processing text/audio/image/video in one model, and LFM2-Audio handling both ASR and TTS. But the trade-off is real - you gain convenience and lower latency, but specialised models still win on pure accuracy. The question for 2026 is whether unified models can close that quality gap. Production-ready full-duplex - 2025 proved full-duplex works. Now it needs to work at scale. We saw big players pushing systems that can handle simultaneous speech with proper turn-taking, interruption handling, and real-world network conditions. But getting latency consistently under 200ms across millions of users is the engineering headache a lot of companies will be tackling. Evaluation that actually measures conversational quality - We're still measuring success mostly by WER. But that doesn't tell you if a conversation feels natural. New benchmarks like UltraEval-Audio, VoiceBench, and OmniBench are starting to measure turn-taking, emotional appropriateness, and multimodal understanding. What research trends do you think we'll see this year? #speech #audio

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    631,886 followers

    Cartesia Sonic-3 is the first AI voice model I’ve seen that nails Hindi perfectly. For years, even the best text-to-speech (TTS) models struggled with Hindi. The rhythm, tonality, and emotional micro-expressions just didn’t sound human and the accent was inaccurate. This model doesn’t just translate Hindi. It is specially trained for it, with precise control over pacing, expressions and  tonality, all rendered in real time. Under the hood, Sonic-3 is engineered for low-latency voice generation optimized for conversational AI agents, clocking in 3–5x faster than OpenAI’s TTS while maintaining superior transcript fidelity. What makes it stand out technically: → 𝗚𝗿𝗮𝗻𝘂𝗹𝗮𝗿 𝗰𝗼𝗻𝘁𝗿𝗼𝗹 𝘁𝗮𝗴𝘀 let developers dynamically modulate speed, volume, and emotion inside the transcript itself. ("Can you repeat that slower?" now works in production.) → 𝟰𝟮-𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝘂𝗹𝘁𝗶𝗹𝗶𝗻𝗴𝘂𝗮𝗹 𝗺𝗼𝗱𝗲𝗹 built on a single unified speaker embedding, so one voice can switch between languages like Hindi, Tamil, and English natively while maintaining accent continuity. → 𝟯-𝘀𝗲𝗰𝗼𝗻𝗱 𝘃𝗼𝗶𝗰𝗲 𝗰𝗹𝗼𝗻𝗶𝗻𝗴 powered by a low-sample adaptive cloning pipeline that enables instant personalization at scale. → 𝗥𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝘁𝗮𝗰𝗸 achieving sub-300 ms end-to-end latency at p90, tuned for live interactions like support agents, NPCs, and healthcare assistants. → 𝗙𝗶𝗻𝗲-𝗴𝗿𝗮𝗶𝗻𝗲𝗱 𝘁𝗿𝗮𝗻𝘀𝗰𝗿𝗶𝗽𝘁 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 that handles heteronyms, acronyms, and structured text (emails, IDs, phone numbers) which usually break realism in production systems. 🎧 Here is example of me trying Sonic-3’s Hindi. You have to hear it to believe it. If you’re building voice agents, conversational AI, or multimodal assistants, keep an eye on Cartesia. They’ve raised $100M to build the most human-sounding voice models in the world, and Sonic-3 just set a new benchmark for multilingual voice AI. #CartesiaPartner

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    42,018 followers

    Voice is the next frontier for AI Agents, but most builders struggle to navigate this rapidly evolving ecosystem. After seeing the challenges firsthand, I've created a comprehensive guide to building voice agents in 2024. Three key developments are accelerating this revolution: -> Speech-native models - OpenAI's 60% price cut on their Realtime API last week and Google's Gemini 2.0 Realtime release mark a shift from clunky cascading architectures to fluid, natural interactions -> Reduced complexity - small teams are now building specialized voice agents reaching substantial ARR - from restaurant order-taking to sales qualification -> Mature infrastructure - new developer platforms handle the hard parts (latency, error handling, conversation management), letting builders focus on unique experiences For the first time, we have god-like AI systems that truly converse like humans. For builders, this moment is huge. Unlike web or mobile development, voice AI is still being defined—offering fertile ground for those who understand both the technical stack and real-world use cases. With voice agents that can be interrupted and can handle emotional context, we’re leaving behind the era of rule-based, rigid experiences and ushering in a future where AI feels truly conversational. This toolkit breaks down: -> Foundation layers (speech-to-text, text-to-speech) -> Voice AI middleware (speech-to-speech models, agent frameworks) -> End-to-end platforms -> Evaluation tools and best practices Plus, a detailed framework for choosing between full-stack platforms vs. custom builds based on your latency, cost, and control requirements. Post with the full list of packages and tools as well as my framework for choosing your voice agent architecture https://lnkd.in/g9ebbfX3 Also available as a NotebookLM-powered podcast episode. Go build. P.S. I plan to publish concrete guides so follow here and subscribe to my newsletter.

  • View profile for Vignesh Kumar
    Vignesh Kumar Vignesh Kumar is an Influencer

    AI Product & Engineering | Start-up Mentor & Advisor | TEDx & Keynote Speaker | LinkedIn Top Voice ’24 | Building AI Community Pair.AI | Director - Orange Business, Cisco, VMware | Cloud - SaaS & IaaS | kumarvignesh.com

    21,250 followers

    🚀 Can AI generate a podcast or audiobook just by listening to your voice -without ever converting it to text? That’s the promise of a new kind of AI model called Spoken Language Models (SLMs). We might have to call it SpLMs as we already know SLM as Small Language models :) Unlike today’s chat-based tools, these models don’t rely on written text at all. They learn directly from raw audio - the way we humans actually speak. Let me try to simplify why this could be another significant milestone in the AI model evolution: Today, most AI tools that we use - like ChatGPT or NotebookLM - work with text. Even if you speak to them, your voice is first converted to text. The AI understands the text, generates a response in text, and then reads it aloud. It's not really "understanding" your voice - just translating it. That’s where SLMs are different. They skip the middle step. They take in your speech directly, and generate responses as speech - tone, pauses, speaker identity, and all. No text in between. By retaining the underlying tone, style and other key elements of the speech, this opens up a wider scope of application in (to name a few): 1) Podcasts that sound like real conversations 2) Audiobooks where the narration flows without glitches 3) Voice assistants that sound more human and less robotic One of the hurdles that had to be overcome was in how older models could only generate short clips — maybe 10 seconds. Beyond that, they’d lose track of the story, forget who’s speaking, or just start repeating themselves. The new model, called #SpeechSSM, tackles this head-on. It combines two techniques: 1) Short-term attention to handle the current sentence 2) Long-term memory to keep the full story coherent over time It also processes audio in parallel chunks, which means it can generate long speech - like 16-minute stories - faster and without losing flow. You might be wondering if NotebookLM already does this? - Well they are great for summarizing notes, answering questions, and working with structured knowledge. But they still rely on text. Spoken Language Models, on the other hand, are being trained to understand and generate audio directly. Not just "talk", but really "speak". This shift can reshape how we use voice in tech - especially for content creation, education, and even real-time conversations. I believe that we are entering a phase where AI doesn’t just read or write - it talks and listens like us. #VoiceAI #SpeechGeneration #ProductThinking #AIForGood I write about #artificialintelligence | #technology | #startups | #mentoring | #leadership | #financialindependence   PS: All views are personal Vignesh Kumar

  • View profile for Henry Ajder
    Henry Ajder Henry Ajder is an Influencer

    AI and Deepfake Cartographer

    17,061 followers

    The speed of progress in open voice cloning is astonishing. Last week, Kyutai open sourced their impressive text to speech model, TTS (https://lnkd.in/e9Nxd4BF). As the video shows, it's very good at replicating a speaker's unique voice, intonation, and mannerisms from only a 10s sample of voice audio. This comes just weeks after Resemble AI released Chatterbox (https://lnkd.in/exUrgpbd), another powerful open source tool for zero shot voice cloning that has seen rapid adoption. The gulf in quality between open source voice cloning models and closed ones used to be the most dramatic of all AI content types. Open tools like Tortoise TTS and Tacotron2 did a decent job, but had nothing on the closed voice cloning tools in terms of speech controllability and expressiveness. Since the beginning of 2025, radical jumps in realism and data efficiency from projects like Kyutai and Chatterbox have reshaped this landscape; the moat is starting to drain. A few other open source projects in the voice cloning space also offer notable quality: - Nari Labs' Dia TTS. (https://lnkd.in/eHKPJmMt) - Canopy Labs's Orpheus Speech (https://lnkd.in/evZuKzUp) - MiniMax's Speech02 (https://lnkd.in/egf3faSa) - CAMB.AI's MARS5 (https://lnkd.in/ecpnxehK) These advances being made 'in the open' will no doubt lead to further research and democratised access to voice cloning applications, but also heighten concerns about weaponisation. Identifying the right balance between openness and safety has never been easy, but the growth of powerful open source AI models means the stakes will only climb higher.

  • View profile for Dr. Dinesh Chandrasekar DC

    CEO & Founder @ Dinwins Intelligence 1st Consulting | Frontier AI Strategist | Investor | Board Advisor| Nasscom DeepTech ,Telangana AI Mission & HYSEA - Mentor| Alumni of Hitachi, GE, Citigroup & Centific AI | Billion $

    36,581 followers

    #VoiceAI just crossed a line most of us didn’t see coming. Alibaba’s #Qwen3-TTS-1.7B isn’t another “better robot voice.” It sounds… human. Uncomfortably so. Natural tone. Emotional range. Accent control. And it runs in real time on everyday hardware. This isn’t a lab demo locked behind enterprise pricing. It’s fully open-source. Real-time. Usable. What stands out isn’t just the feature list, but what it signals. With a few seconds of reference audio, a voice can be recreated. Emotion is no longer implied; it’s instructed. Latency is low enough for live conversations. Languages are handled with consistency, not patchwork fixes. And the license removes the meter that used to tick with every word spoken. The quiet shock is this: Benchmarks show speaker similarity that rivals, and in some cases exceeds, well-known proprietary voice platforms—on a single GPU. That changes the economics overnight. Voice once meant studios, contracts, and per-minute costs. Now it means open models, local deployment, and fully owned voice systems. For builders, this opens doors that were previously bolted shut: Real-time agents that don’t sound synthetic. Accessibility tools that feel respectful, not mechanical. Learning, gaming, storytelling, and support systems where voice is no longer the bottleneck. The interface just became more human. And that’s exactly where the unease begins. When voices can be copied this easily, sound loses its authority. Audio can no longer stand alone as proof. Impersonation, fraud, and social engineering don’t need better scripts anymore. They just need a familiar voice. This is why risk, verification, and trust systems can no longer be optional layers. They are fast becoming core infrastructure. We are stepping into a phase where: Seeing was already questionable. Now hearing is too. Technology taught machines how to speak with us. The harder task ahead is teaching ourselves how to listen—carefully, critically, and with context. Progress didn’t slow down. It just got a voice.

  • View profile for Abhijeet Satani

    Research Scientist | Inventor of Cognitively Operated Systems 🧠 | Neuroscience | Brain Computer Interface (BCI) | Published Author with a BCI patent and several other Patents (mentioned below🔻) and IPRs

    8,888 followers

    A high-performance speech neuroprosthesis, developed by Stanford researchers, decodes attempted speech directly from brain activity—restoring a voice to individuals who have lost the ability to speak. Key Findings: 📍Rapid and naturalistic decoding: The system translated neural signals into real-time text at 62 words per minute—nearly 3.5× faster than prior BCI systems. This speed brings decoded communication closer to everyday conversation, offering a major leap in usability and responsiveness. 📍Robust phoneme mapping and vocabulary range: Impressively, the neuroprosthesis operated with a 125,000-word vocabulary—the largest ever used in speech BCI—while maintaining semantic accuracy. Neural representations of phonemes remained intact even years after speech loss, suggesting the brain’s motor-speech pathways are more persistent than previously assumed. 📍Rethinking the neural basis of speech: While traditional models emphasize Broca’s area, this study found that area 6v was more predictive of speech intention. Furthermore, the system successfully decoded both spoken and silently mouthed words, demonstrating that silent articulation retains a reliable neural signature—crucial for fatigue-free, discreet communication. By Willett et al., Nature, 2023 https://rdcu.be/eyFkC Implication: This work marks a major milestone for brain–computer interfaces, bridging neuroscience and assistive technology to restore speech—and reshaping our understanding of the brain’s language architecture. #BrainComputerInterface #Neuroprosthetics #SpeechNeuroprosthesis #Neuroscience #Stanford #ALS #Neurotech #BCI

Explore categories