Why ASR Benchmarks Fail in Real-World Scenarios

1,067,348 followers

SOTA ASR models report near-human accuracy on public test sets. So why do they still fail users in the real world? The answer isn't the models, it's the benchmarks. Widely used benchmarks like LibriSpeech are built on clean, scripted, accent-narrow speech. Real production speech is spontaneous, noisy, accented, and multi-speaker. That gap doesn't show up on a leaderboard. It shows up when your product ships. There's also a compounding problem: benchmaxxing. Models optimised to climb public leaderboards don't generalise; they're tuned to the test, not the task. Our new whitepaper lays out how to fix this. We cover: → Why current benchmarks systematically overstate real-world ASR performance → The evidence: WERs that jump from ~12% on read speech to 42% on casual conversation → Our 5-stage methodology for building production-representative speech benchmarks (scoping → contributor sourcing → speech design → recording → transcription) → How private, held-out benchmark sets resist benchmaxxing → Our partnership with Hugging Face to make the Open ASR Leaderboard a more trustworthy signal If you're building or evaluating ASR systems, this is the benchmarking gap your eval stack may not be surfacing. Read the full whitepaper [link in comments] #SpeechRecognition #ASR #AIBenchmarking #NLP #SpeechAI #MachineLearning #DataQuality

2 Comments

Appen 1w

https://www.appen.com/whitepapers/production-representative-speech-benchmarks-asr-model-performance

2 Reactions

Harsh Gupta 1w

Training on clean, scripted speech and then testing it on real users is like studying for the wrong exam. Evaluation quality matters just as much as model quality.

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Tammy Ann Haskins
1w
Report this post
This gets to the heart of something I see come up time and again in Speech AI conversations. Near-human accuracy on a benchmark means very little if that benchmark was built on clean, scripted audio — and your users are speaking spontaneously, with accents, in noisy environments. The jump from ~12% WER on read speech to 42% on casual conversation isn't a model problem. It's a benchmarking problem. And it only shows up after you ship. Really proud of the work Appen has done here — both the whitepaper laying out a practical methodology for production-representative benchmarks, and the partnership with Hugging Face to make the Open ASR Leaderboard a more trustworthy signal. If you're building or evaluating ASR systems, worth a read. 👇 #SpeechAI #ASR #AIBenchmarking #MachineLearning
Appen

1,067,348 followers
1w

SOTA ASR models report near-human accuracy on public test sets. So why do they still fail users in the real world? The answer isn't the models, it's the benchmarks. Widely used benchmarks like LibriSpeech are built on clean, scripted, accent-narrow speech. Real production speech is spontaneous, noisy, accented, and multi-speaker. That gap doesn't show up on a leaderboard. It shows up when your product ships. There's also a compounding problem: benchmaxxing. Models optimised to climb public leaderboards don't generalise; they're tuned to the test, not the task. Our new whitepaper lays out how to fix this. We cover: → Why current benchmarks systematically overstate real-world ASR performance → The evidence: WERs that jump from ~12% on read speech to 42% on casual conversation → Our 5-stage methodology for building production-representative speech benchmarks (scoping → contributor sourcing → speech design → recording → transcription) → How private, held-out benchmark sets resist benchmaxxing → Our partnership with Hugging Face to make the Open ASR Leaderboard a more trustworthy signal If you're building or evaluating ASR systems, this is the benchmarking gap your eval stack may not be surfacing. Read the full whitepaper [link in comments] #SpeechRecognition #ASR #AIBenchmarking #NLP #SpeechAI #MachineLearning #DataQuality
Like Comment
To view or add a comment, sign in
Tobi Shofodun
3w Edited
Report this post
𝘿𝙪𝙧𝙞𝙣𝙜 𝙖𝙣𝙣𝙤𝙩𝙖𝙩𝙞𝙤𝙣, most people read guidelines. A few people understand them. There's a difference. 𝗡𝗼𝘄 𝘄𝗮𝗹𝗸 𝘄𝗶𝘁𝗵 𝗺𝗲 🚶⤵️ I've been doing audio annotation work, and early in the project something happened that made this distinction very real. An 𝗲𝗱𝗴𝗲 𝗰𝗮𝘀𝗲 kept appearing. Tricky audio. Unusual pattern. Nothing in the documentation. People did what most people do when they hit a gap: They stopped. Guessed. Or skipped it. I went looking for the rule behind the rule. Not in the guidelines, obviously they didn't have it. In the project itself. During my research, I asked a few 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀👇 What is this model being trained to do? What does it need to understand about this type of audio? If the guidelines had covered this, what would they have said, and why? That line of thinking got me to an 𝗮𝗻𝘀𝘄𝗲𝗿 I felt confident in. I shared it. Others applied it. A week later, the official guidelines caught up. And word for word, my 𝗮𝗻𝘀𝘄𝗲𝗿 matched. And here's what I took from that: 𝘙𝘶𝘭𝘦𝘴 𝘵𝘦𝘭𝘭 𝘺𝘰𝘶 𝘸𝘩𝘢𝘵 𝘵𝘰 𝘥𝘰. 𝘜𝘯𝘥𝘦𝘳𝘴𝘵𝘢𝘯𝘥𝘪𝘯𝘨 𝘵𝘦𝘭𝘭𝘴 𝘺𝘰𝘶 𝘸𝘩𝘺. "𝗪𝗵𝗮𝘁" expires at the edge of the guidelines. "𝗪𝗵𝘆" works anywhere.. Including exactly where the rules run out. In annotation, the guidelines will always have gaps. Real projects move faster than documentation. Edge cases appear before anyone could anticipate them. The annotators who only know the rules are only safe inside them. The annotators who understand the purpose? They're functional everywhere. That's the skill worth building. #AudioAnnotation #DataAnnotation #AIAnnotation #AITraining #MachineLearning #DataQuality #ArtificialIntelligence #NLP
Like Comment
To view or add a comment, sign in
Md Fahim Sarker
3w
Report this post
Most people think YouTube captions are just speech-to-text. They're not. There's a 7-step pipeline running behind every video you watch: Audio gets extracted and cleaned. ASR models analyze sound patterns and predict words. A language model then corrects grammar and context. Timestamps sync each word to the audio. Punctuation and formatting are added automatically. Then, optionally, machine translation kicks in. And every user correction feeds back into the system to make it smarter. Transformer-based ASR, Conformer models, and large language models are all working together just so you can read "Welcome to my channel" at the right second. Strong accents, noisy backgrounds, overlapping speakers, slang, and music are still the biggest failure points. The system is really impressive. Understanding the pipeline changes how you think about building for accessibility. #MachineLearning #NLP #SpeechRecognition #AIEngineering #Accessibility #DeepLearning #YouTubeAI #ASR #ArtificialIntelligence #TechInsights #MLEngineering #DataScience #Google #STT #MachineLearning #LLM #Transformers
Like Comment
To view or add a comment, sign in
AIxpanse

15 followers
2w
Report this post
You'll waste three debugging sessions before you realize your transformer isn't actually looking at what you think it's looking at. Multi-head attention is just running multiple attention operations in parallel, then concatenating their outputs. That's it. Each head gets its own set of weights to learn different patterns. But here's what actually matters: a single attention mechanism can only capture one type of relationship at a time. You can't simultaneously learn "this word relates to the subject" AND "this word relates to the previous sentence's context" with one set of weights. The math doesn't allow it. So BERT uses 12 heads per layer. GPT-3 uses 96. You're essentially betting that different heads will specialize in different linguistic patterns without you explicitly telling them to. The honest part? When you actually visualize what each head learned, you'll find that head 1 focuses on syntax, head 2 captures positional info, head 3 does something with sentence boundaries, and head 4 is doing... something. Maybe. Nobody really knows if we need 12 or 47 or 6. It's hyperparameter archaeology at this point. Has anyone actually found a use for per-head analysis in production, or is it all just research theater? #MultiHeadAttention #Transformers #MachineLearning #NLP
Like Comment
To view or add a comment, sign in
Aishwarya sankaranarayanan
4w
Report this post
How GloVe Captures Word Similarity (and Why CBOW Struggles) 📌 Why does GloVe capture word similarity better than CBOW? Context: GloVe (Global Vectors for Word Representation) is one of the word embedding technique like CBOW that learns word vectors by using global word co‑occurrence statistics from a corpus. The key difference lies in what information they use. 👉 CBOW Looks at local context Uses a sliding window Learns word meaning from nearby words only Think of CBOW like: Learning a person’s personality by talking to them for 5 minutes at a time 👉 GloVe Looks at global co‑occurrence statistics Considers how words co‑occur across the entire corpus Think of GloVe like: Understanding a person by observing all their interactions over years ✅ This global view lets GloVe capture stronger semantic relationships (king–queen, Paris–France). 📌 That’s why similar words end up closer in GloVe embeddings. #NLP #GloVe #WordEmbeddings #MachineLearning
Like Comment
To view or add a comment, sign in
Naman Lazarus
3w
Report this post
Are interpretability methods actually useful for downstream tasks? Sparse Autoencoders (SAEs) have become one of the most popular tools in mechanistic interpretability, but recent work by Kantamneni et al. showed that SAE probes consistently fail to outperform simple baselines on real-world probing tasks. This raised a natural question for us: is this an SAE-specific limitation, or does it apply to the broader family of Sparse Dictionary Learning methods? We extended their evaluation framework to two additional architectures (transcoders and crosscoders) and tested all three against baseline probes across 9 datasets and multiple challenging regimes (data scarcity, class imbalance, label noise). The short version: the baseline still wins. But the full picture is more nuanced — transcoders show some interesting properties, and there are narrow settings where SDL methods have a genuine edge. Full paper, blog post, and code coming soon. Stay tuned. #MechanisticInterpretability #AIResearch #NLP #LLMs #Interpretability #MachineLearning
Like Comment
To view or add a comment, sign in
StatQuestions.com

30 followers
4w
Report this post
Stop fighting for a 2% survey response rate. While you’re waiting for users to fill out a form, you’re ignoring the "Ambient Feedback" already sitting in your support inbox. Traditional surveys often suffer from selection bias: you typically hear from the extremely happy or the extremely frustrated. Your support tickets, however, represent a 100% response rate from the users actively engaging with your product. At StatQuestions, we move beyond simple keyword counting. We utilize advanced NLP and AI-powered theme extraction to mine this data goldmine. By applying machine learning to your existing data flows, we can calculate confusion scores, identify escalation risks, and pinpoint root causes with statistical rigor. We turn a chaotic inbox into a structured, longitudinal feedback loop that tracks changes over time and correlates insights directly to your KPIs. Stop guessing what your customers want based on a tiny sample size. Start analyzing the volume of data you already own. Transform your support inbox at statquestions.com #DataScience #CustomerSuccess #NLP #BusinessIntelligence
Like Comment
To view or add a comment, sign in
Chetan There
3w
Report this post
🧠 Understanding Q, K, V — The Heart of Attention in Transformers At the core of every modern LLM lies a simple but powerful idea: attention. And attention is built on three matrices: 👉 Q (Query) – What I’m looking for 👉 K (Key) – What I contain 👉 V (Value) – What I offer Think of it like this: When a word in a sentence tries to understand its context, it doesn’t look at everything equally. It asks a question (Q) and compares it with all other words' keys (K). The similarity between Q and K determines how much attention to pay. Then comes the real output: Those attention scores are applied to V (Values) → giving a weighted understanding of context. ⚙️ In one line: Attention(Q, K, V) = softmax(QKᵀ / √d) · V 💡 Why this matters: Captures relationships between words (even far apart) Enables context-aware understanding Powers models like GPT, BERT, and all modern LLMs 🚀 Real intuition: In the sentence: "The animal didn’t cross the street because it was too tired." 👉 “it” attends more to “animal” than “street” That’s attention in action. 🔍 Key takeaway: Attention is not about looking at everything. It’s about learning what matters, when it matters. #AI #MachineLearning #DeepLearning #Transformers #LLM #NLP #AttentionMechanism #ArtificialIntelligence #MachineLearning #DeepLearning #NeuralNetworks #AIModels #GenerativeAI #LLMs #TransformerModels #NLP #NaturalLanguageProcessing
Like Comment
To view or add a comment, sign in
Nikita Gushchin
1w
Report this post
Happy to share that our new paper, IDLM: Inverse-distilled Diffusion Language Models, has been accepted to ICML 2026! Recently, we presented our ICLR oral paper on Universal Inverse Distillation, introducing a general framework for accelerating matching-based generative models. In IDLM, we extend this idea to discrete diffusion language models. Huge thanks to all co-authors: David Li, eric Moulines, Ivan Oseledets, Maxim Panov, and Alexander Korotin. Code, checkpoints, and paper are in the reposted announcement.
David Li

PhD in Machine Learning @ MBZUAI | Researcher in GenAI
1w

🚀 IDLM (ICML 2026) is now open-source! Diffusion Language Models have shown strong text generation quality, but their iterative sampling can make inference slow. In this work, we introduce IDLM, a framework for accelerating discrete diffusion language models through inverse distillation, turning many-step teachers into efficient few-step generators. 🔥 In our experiments, IDLM reduces inference steps by 4×–64× while preserving key generation quality metrics such as entropy and generative perplexity. We are happy to release everything needed to reproduce and build on the work: 💻 Code: https://lnkd.in/eAnZux4J 🤗 Checkpoints: https://lnkd.in/e8AZ5zBh 📄 Paper: https://lnkd.in/dZCcqEPJ Huge thanks to my amazing co-authors and collaborators: Nikita Gushchin, Dmitry Abulkhanov, eric Moulines, Ivan Oseledets, Maxim Panov, and Alexander Korotin. We hope this helps push diffusion-based language generation toward faster, more practical, and more reproducible systems. Feedback, issues, stars, and experiments are very welcome ⭐ #ICML2026 #MachineLearning #DiffusionModels #LanguageModels #GenerativeAI #OpenSource #DeepLearning #NLP #Reproducibility
Like Comment
To view or add a comment, sign in
HULAT (Human Language and Accessibility Technologies Group)-UC3M

234 followers
1w Edited
Report this post
Following the #LREC2026 article "A Human-in/on-the-Loop Framework for Accessible Text Generation" by Lourdes Moreno López and Paloma Martínez Fernández, the proposed framework has been implemented in LangGraph as an operational multi-agent pipeline. 🎥 Simulated demo: https://lnkd.in/eynNPnyp The video below shows a simulated demo of the execution trace. It illustrates how the system: • analyzes complexity, • generates alternative simplifications, • evaluates semantic risk, • decides whether to retry or escalate, • and keeps the process traceable and auditable. For that reason, the framework prioritizes controlled generation, transparency, and governance rather than fully automatic simplification. 🔹 Paper: https://lnkd.in/eb9pWBMR 🔹 Video of the paper presented at #LREC2026: https://lnkd.in/ev4kBEBF ELRA Language Resources Association HULAT (Human Language and Accessibility Technologies Group)-UC3M Universidad Carlos III de Madrid #LREC2026 #LangGraph #Accessibility #CognitiveAccessibility #PlainLanguage #EasyToRead #NLP #AI #HumanInTheLoop #HumanOnTheLoop #ExplainableAI #TextSimplification
Like Comment
To view or add a comment, sign in

1,067,348 followers

View Profile Follow

Why ASR Benchmarks Fail in Real-World Scenarios

More Relevant Posts

Explore related topics

Explore content categories