Appen’s cover photo
Appen

Appen

IT Services and IT Consulting

Kirkland, Washington 1,067,404 followers

Appen is your trusted data partner, powering cutting-edge AI applications for the world's most innovative companies.

About us

Appen has been a leader in AI training data for over 25 years, providing high-quality, diverse datasets that power the world's leading AI models. Our end-to-end platform, deep expertise, and scalable human-in-the-loop services enable AI innovators to build and optimize cutting-edge models. We specialize in creating bespoke, human-generated data to train, fine-tune, and evaluate AI models across multiple domains, including generative AI, large language models (LLMs), computer vision, speech recognition, and more. Our solutions support critical AI functions such as supervised fine-tuning, reinforcement learning with human feedback (RLHF), model evaluation, and bias mitigation. Our advanced AI-assisted data annotation platform, combined with a global crowd of more than 1M contributors in over 200 countries, ensures the delivery of accurate and diverse datasets. Our commitment to quality, scalability, and ethical AI practices makes Appen a trusted partner for enterprises aiming to develop and deploy effective AI solutions. At Appen, we foster a culture of innovation, collaboration, and excellence. We value curiosity, accountability, and a commitment to delivering the highest-quality AI solutions. We support work-life balance with flexible work arrangements and a dynamic, results-driven environment. Employees have access to competitive pay, comprehensive benefits, and opportunities for continuous learning and career growth. Our team works closely with the world’s top technology companies and enterprises, tackling exciting challenges and shaping the future of artificial intelligence.

Website
http://appen.com
Industry
IT Services and IT Consulting
Company size
501-1,000 employees
Headquarters
Kirkland, Washington
Type
Public Company
Founded
1996
Specialties
Search, Annotation, Evaluation, Personalization, Transcription, Spam Detection, Translation and Localization, Data Collection, training data, artificial intelligence , machine learning, data preparation, model evaluation, datasets, computer vision, natural language processing, LLM, and generative ai

Locations

Employees at Appen

Updates

  • View organization page for Appen

    1,067,404 followers

    Great discussion at SlatorCon London 2026 around the evolving role of human expertise in AI development. One theme that came up repeatedly: as models become more capable, the bottleneck is no longer just data volume. It’s evaluation quality, domain expertise, and the ability to generate meaningful feedback signals for increasingly complex systems. A few areas Sergio Bruccoleri, VP, Delivery, Appen, touched on during the panel: • Evaluation is shifting beyond static benchmarks toward more dynamic, environment-based assessment • Domains like coding, STEM, legal, and finance require deeper subject matter expertise to properly evaluate reasoning quality and edge cases • As agentic systems evolve, high-quality human feedback becomes increasingly important for alignment, reliability, and model behavior in production settings Interesting conversations across the broader language AI ecosystem on where training, evaluation, and human-in-the-loop systems are headed next.

    • No alternative text description for this image
  • View organization page for Appen

    1,067,404 followers

    Appen recently completed an independent third-party evaluation of Subquadratic's SSA (Sparse Self-Attention) kernel. The core architectural claim: replacing O(n²) full self-attention with a learned sparse formulation that routes computation only to the most relevant key-value blocks, enabling near-linear scaling as context length increases. What we measured: Efficiency (NVIDIA B200, bfloat16, PyTorch 2.11.0) - 56.2× wall clock speedup vs. FlashAttention-2 at 1M tokens - 62.8× FLOP reduction vs. dense attention at 1M tokens - FLOP counts independently validated via torch.profiler (within 0.7–3.9% of theoretical) Long-context retrieval - RULER at 128K tokens - 95.6% average score across all evaluated tasks (LLM-judged via Claude Opus 4.6) - Perfect retrieval on all single-needle tasks Ultra-long context - MRCR at 512K–1M token context lengths - 86.2% average score on the hardest 8-needle retrieval bucket Coding - SWE-Bench Verified - 81.8% resolved rate with extended thinking enabled Evaluation was conducted independently with access scoped to API endpoints only. No model weights, training data, fine-tuning configurations, or ground-truth labels were provided in advance. The efficiency scaling results are particularly notable. Full report: https://lnkd.in/e-Q4FrEt #LongContextLLM #AttentionMechanism #AIBenchmarking #SparseAttention #NLP #MachineLearning

    • No alternative text description for this image
  • View organization page for Appen

    1,067,404 followers

    SOTA ASR models report near-human accuracy on public test sets. So why do they still fail users in the real world? The answer isn't the models, it's the benchmarks. Widely used benchmarks like LibriSpeech are built on clean, scripted, accent-narrow speech. Real production speech is spontaneous, noisy, accented, and multi-speaker. That gap doesn't show up on a leaderboard. It shows up when your product ships. There's also a compounding problem: benchmaxxing. Models optimised to climb public leaderboards don't generalise; they're tuned to the test, not the task. Our new whitepaper lays out how to fix this. We cover: → Why current benchmarks systematically overstate real-world ASR performance → The evidence: WERs that jump from ~12% on read speech to 42% on casual conversation → Our 5-stage methodology for building production-representative speech benchmarks (scoping → contributor sourcing → speech design → recording → transcription) → How private, held-out benchmark sets resist benchmaxxing → Our partnership with Hugging Face to make the Open ASR Leaderboard a more trustworthy signal If you're building or evaluating ASR systems, this is the benchmarking gap your eval stack may not be surfacing. Read the full whitepaper [link in comments] #SpeechRecognition #ASR #AIBenchmarking #NLP #SpeechAI #MachineLearning #DataQuality

    • No alternative text description for this image
  • Appen reposted this

    Good to see Zoom Scribe API is competitive for real world speech recognition benchmarks.

    View profile for Steven Zheng

    Machine Learning Engineer @Hugging Face 🤗 | MVA @ENS Paris-Saclay

    Big announcement for speech AI Benchmarks get gamed. So we added a repellent. The Open ASR Leaderboard now includes private evaluation data from Appen and DataoceanAI, making speech recognition benchmarks more robust against test-set contamination and “benchmaxxing.” Better signal. Less overfitting. More real-world ASR. Read more 👇 https://lnkd.in/dwTZheD2

    • No alternative text description for this image
  • Appen reposted this

    Data makes all the difference. Especially when it reflects real production conditions. One of the biggest challenges in AI today is what we call “benchmaxxing” — training and testing models on the same public datasets, only to see performance drop once models hit the real world. That’s exactly why Appen partnered with Hugging Face on the #OpenASR Dashboard initiative. As AI solutions become more productized and customer-facing, production readiness matters more than leaderboard scores. Models need to perform under shifting datasets, noisy environments, and real operational constraints — not just in controlled benchmarks. In this article, we share: • How Appen contributes to the OpenASR Dashboard • Our methodology for building evaluation datasets • Why production-grade data is critical for trustworthy AI benchmarking • How we help ensure models are ranked based on true real-world performance Because the future of AI evaluation isn’t just about benchmark accuracy. It’s about reliability in production. Link to the article in the comments!

  • Appen reposted this

    Things are shaking up on the Open ASR Leaderboard 🪇 We added 11 dataset splits to the leaderboard, so what changed? Well if you’re just looking at the average WER: 𝗻𝗼𝘁𝗵𝗶𝗻𝗴. We’ve kept it as the average over standard public benchmarks (AMI, Earnings22, Librispeech, etc). Public ASR benchmarks are incredibly valuable, but they also come with known limitations: • transcription inconsistencies and errors • "benchmaxxing", namely optimizing for leaderboard performance rather than real-world robustness Some of this can even happen unintentionally, e.g. if pre-trained LLMs used benchmark transcripts or metadata in their training corpora. Reliable ASR evaluation is hard. The truth is there’s no single dataset or metric that fully captures real-world performance, and there is no one model to rule them all. To improve robustness and better capture these nuances in the Open ASR Leaderboard, we’ve worked with Appen and Dataocean AI to add 11 high-quality, private English datasets spanning: • scripted + conversational speech • multiple accents (American, Australian, British, Canadian, Indian) We added these in a new "Private data" tab on the Open ASR Leaderboard. From the main "Leaderboard" tab, the average remains across the public datasets, and you can toggle on the private sets (or toggle off public sets) to see how it affects the average WER. 𝗧𝗵𝗶𝘀 𝗵𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀 𝗵𝗼𝘄 𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝘁 𝗔𝗦𝗥 𝗿𝗮𝗻𝗸𝗶𝗻𝗴𝘀 𝗮𝗿𝗲 𝗼𝗻 𝘁𝗵𝗲 𝗰𝗵𝗼𝗶𝗰𝗲 𝗼𝗳 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗱𝗮𝘁𝗮. Trustworthy transcription matters. ASR is often the first component in conversational systems, and transcription failures propagate downstream into LLMs and user experience, which is why evaluation choices matter so much. 𝗧𝗵𝗲𝘀𝗲 𝗻𝗲𝘄 𝗱𝗮𝘁𝗮𝘀𝗲𝘁𝘀 𝗮𝗿𝗲 𝗼𝗻𝗹𝘆 𝗮 𝘀𝘁𝗲𝗽, 𝗻𝗼𝘁 𝗮 𝗳𝗶𝗻𝗮𝗹 𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻. Relevant and trustworthy evaluation cannot stay static. The leaderboard remains community-driven, and we’d love your feedback, suggestions for improving evaluation, and additional datasets (public or private). 📝 Blog on the new private sets: https://lnkd.in/eJCeTcsp 🧑💻 GitHub for contributions and suggestions: https://lnkd.in/enNrqUke 💡 Also highly recommend Dylan Fox’s piece on the limitations of widely-used datasets: https://lnkd.in/eWx3uUc2 Let’s build more reliable ASR evaluation 🤗

  • View organization page for Appen

    1,067,404 followers

    Seven private English ASR datasets by Appen are now powering a new evaluation track on the Hugging Face Open ASR Leaderboard. Better signal, harder to game. 📖 https://lnkd.in/eFibXcES https://lnkd.in/eXr54bcd

    Big announcement for speech AI Benchmarks get gamed. So we added a repellent. The Open ASR Leaderboard now includes private evaluation data from Appen and DataoceanAI, making speech recognition benchmarks more robust against test-set contamination and “benchmaxxing.” Better signal. Less overfitting. More real-world ASR. Read more 👇 https://lnkd.in/dwTZheD2

    • No alternative text description for this image
  • View organization page for Appen

    1,067,404 followers

    Appen partnered with Hugging Face to bring private, benchmarking-resistant ASR evaluation datasets to the Open ASR Leaderboard. The leaderboard has been visited over 710,000 times since 2023. But public benchmarks have a problem: models can be optimized to climb rankings without actually performing better in the real world. That's benchmaxxing. Our contribution: seven private English ASR datasets spanning scripted and conversational speech across American, Australian, Canadian, and Indian accents. Because they're kept private, they can't be gamed, making leaderboard results more trustworthy. The data speaks for itself. When our datasets are included, model rankings shift. That's the signal a public-only benchmark can't give you. Read the full story → https://lnkd.in/eFibXcES #SpeechAI #ASR #AIEvaluation #HuggingFace #Benchmarking

    • No alternative text description for this image
  • View organization page for Appen

    1,067,404 followers

    If you’re at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) this week, join us for an evening with Appen and Hugging Face. May 6 (Wednesday) | 6:00-9:00 PM We’re bringing together researchers, engineers and practitioners working across speech, audio, NLP and multimodal AI for a casual happy hour. Whether you want to exchange ideas or just unwind after a full day of sessions, this is the space for it. No pitches. No demos. Just people, drinks and real conversations. If you're around, come by and say hi. Register here: https://luma.com/7j9nagtx

    • No alternative text description for this image

Similar pages

Browse jobs