Freeplay’s cover photo
Freeplay

Freeplay

Software Development

Boulder, Colorado 1,545 followers

The ops platform for AI engineering teams.

About us

A better way to build with LLMs. Bridge the gap between domain experts & developers. Prompt engineering, testing & evaluation tools for your whole team.

Website
http://freeplay.ai
Industry
Software Development
Company size
11-50 employees
Headquarters
Boulder, Colorado
Type
Privately Held
Founded
2022
Specialties
Artificial Intelligence, Developer Tools, and Evals

Locations

Employees at Freeplay

Updates

  • Freeplay reposted this

    Boulder AI Builders is back for Boulder Startup Week -- Thursday May 7th! RSVP link in the comments. Past BSW events have been our biggest meetups of the year, and we expect this one will be again. Come see what local companies are building in AI (and sign up to demo if you have something to show off!). We've been so proud to play a role in helping build this community alongside our partners at Ombud, Matchstick Ventures, Silicon Valley Bank and Technical Integrity. Thanks to everyone who comes out to these for making it so much fun to build AI products in Colorado. Y'all are the best. ⛰️

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • Freeplay reposted this

    🎙️New episode of Deployed is out with Loïc Houssier, CTO at Superhuman Mail. Superhuman charges $30/month for email when Gmail is free. That’s always forced them to maintain a different quality bar from most products, and it shapes everything about how they build AI features too. Full disclosure: I've been a paying customer for ~5 years and recommend it constantly. It's one of those rare products software products I’d fight to keep using. Loïc and I go back a while, so it was great to catch up on what he's building. He’s one of the most fun and energized engineering leaders I’ve gotten to work with, so it’s always fun to chat. A few things that stood out about his mindset and the way the Superhuman Mail team approaches building: ✅ "Make it work is not enough. You need to make it great." Loïc talks about how things other SaaS companies might treat as non-urgent feedback, Superhuman treats like a critical bug. This is part of how they maintain the quality bar. ✅ Focusing on the hardest examples for their AI evals. They build evals starting from crazy internal queries - like their CEO Rahul trying to find what type of wood he discussed with a contractor three years ago, buried in thousands of other emails. ✅ Engineering quality week. The first week of every quarter, the whole engineering team focuses on bugs *and* personal workflow improvement. It's dedicated time to experiment with new AI tools and improve their personal setups. ✅ Removing all the blockers for AI tools. e.g. a 24-hour security approval SLA, or giving engineers unlimited budgets for coding agent subscriptions. They didn't want to waste time on procurement when engineers should be experimenting. We also got into other topics like how they apply game design principles to building an email client (*not* tacky “gamification”), how they're thinking about AI quality for high-dimensional email problems, and why this is such a fun moment to be building software products. PS: they’re hiring! Reach out to Loïc. Link in comments to see the full episode. #AI #ProductDevelopment #Engineering

  • Freeplay reposted this

    🎙️ The next episode of Deployed is out! This one with Lijuan Qin, Head of Product for Zoom AI. It's a great listen especially for PMs working on AI products. Link in comments. Lijuan has a PhD in AI and spent 20 years at Microsoft on NLP and video understanding before joining Zoom. She's shipping agents into a 300M+ user base, with both consumer and enterprise considerations. What I liked about this conversation: Lijuan talks about how their team has moved past the "did the AI give the right answer?" framing for evaluation. As they've shifted from Q&A chatbots to agents that complete workflows, how they measure quality has changed too. A few things that stood out: 😬 High engagement can be a bad sign. If users keep going back and forth with your agent, the product might be failing them. Her team measures weekly retention and task completion instead of interaction volume. ✅ Zoom's "conversation to completion" bet. Action items from meetings are broken today. Most AI note-takers make a to-do list and then nothing happens. Zoom wants to build agents that actually do the follow-through. 🎯 How she scopes failure: define assumptions and success criteria upfront, box the blast radius, and then teams can experiment without approval queues. That first point is the one I've thought about the most. Most teams celebrate engagement numbers. If you're building an agent, it's worth asking yourself what you're optimizing for. #AIProducts #ProductManagement #ZoomAI #Agents

    • No alternative text description for this image
  • We're building the ops infrastructure for AI engineering. Our new Insights features combined with our MCP server totally change what's possible for debugging and improving agents fast. Check it out. 👇

    We've been thinking a lot about a workflow problem that every serious AI team faces now: How do you use agents to improve agents efficiently? Right now, when a coding agent needs to debug an AI product, the default in most cases is to try to pull down thousands of production traces and pattern-match its way to a root cause. That's slow and expensive at best. Kind of like handing someone a phone book and asking them to find the interesting people... Our new Insights agent was built to fix that. Rather than wait for a request, Freeplay's agent runs in the background, looking at production evaluator scores and reasoning traces on a schedule. It then clusters patterns, ranks issues by impact, and links each one to specific traces that show the problem. This is the kind of infrastructure coding agents need to work efficiently -- pre-computed analysis and signal so they can debug fast, not on-demand queries. Now when you ask Claude Code "what's wrong with our agent?", the Freeplay MCP server doesn't hand back 10,000 traces. It returns the top insights that matter this week, with the 50 traces that demonstrate them. The agent can know immediately what to look for before it reads a single log line. The old way: dig through lots of logs, hope you spot something. New way: start with the diagnosis, then dig deeper to decide how to fix it. This same workflow also helps every human user who logs into Freeplay. Dashboards and metrics help, but Insights tell you where to look much faster. We wrote up how it works and where it fits in the broader data flywheel to continuously improve an agent. Check out the video (shoutout to Jeremy Silva), and the link in the comments.

  • Freeplay reposted this

    🎙️ New episode of Deployed is out! This one with with Kevin Stanton, Distinguished Engineer at Sprout Social, Inc. Link in comments. Kevin's spent 13 years at Sprout (NASDAQ: SPT), where they build infrastructure that processes billions of social posts and turn it into signal for companies of all shapes and sizes. Now he's part of the team building Trellis, their AI agent that "turns social data into instant enterprise intelligence." What I liked about this conversation: Kevin's team didn't have deep generative AI experience going in. He's refreshingly honest about what worked, what broke, and what they learned the hard way. Lots of experienced engineering and product leaders can relate. A few insights that stood out: ✅ Why MCP felt more natural than RAG for their system of record. Their existing (pre-LLM) classification models became a superpower. ✅ "LLMs are the most expensive switch statement on the planet," aka when to collapse MCP tools and use deterministic code instead ✅ Why they pulled evals out of CI/CD after it broke things ("non-deterministic tests are a nightmare") ✅ The strategic reason they started with chat (see clip below) That last one is a good example of Kevin's and his team's practical thinking. Chat isn't just about UX. He points out that it's also the fastest way to seed your evals with real traces, and he calls it the "product manager's holy grail" to understand what customers really want to do with your product. If you're building agents or shifting a traditional engineering team toward AI, this one's worth your time. #AIAgents #LLMs #ProductDevelopment #Engineering

  • A year into running their AI Assistant, Cisco Duo Security has built an AI evals and quality practice worth studying. Here's how they got designers, PMs, and engineers into production data together. 👇 We're proud to support their team.

    New case study out today about eval and AI optimization workflows with Cisco Duo Security, and this one's a bit different: A big part of their success involves Design. 👀 Most of the conversation about AI quality focuses on engineering. Duo deeply involves designers into AI quality work, alongside engineering and product. They shared what their process looks like, and why they do things this way. For the past year, the Duo AI Assistant team has been running what they call a "communal" quality practice. Every week, designers, PMs, data scientists, and engineers each review 15-20 real assistant conversations. They calibrate together on Fridays. They've turned cross-functional collaboration into cadence. What stood out to me: the Design team doesn't see this as a chore. They see it as part of their user research process. Jillian Haller, Design Manager: "This is the next best thing to a contextual inquiry... We're actually able to see how the interaction unfolds." A year in, the results speak for themselves: Duo is expanding their AI Assistant to global customers, their team and leadership have clear visibility into quality, and the team has built the operational muscle to keep improving week over week. Huge thanks to Brianna Penney, Jillian Haller, Laura Cole, and Shakeel Ahamed for sharing their story with us. They built this practice without a blueprint, and now it's become a model for other teams to learn from. We're proud Freeplay is the shared surface where their cross-functional team collaborates on AI quality. Full case study in the comments. 👇

    • No alternative text description for this image
  • It's time for another Boulder AI Builders meetup! Join us Feb 11. 🙌

    IT'S TIME AGAIN! Colorado AI Builders kicks off for the new year on February 11 in Boulder. Now officially the start of our 4th year doing this. 🏔️🧠 RSVP link in the comments. The last Boulder meetup had nearly 300 people. We're excited to get back together again and show off some of the latest cool AI projects being built in CO. We've had a ton of demand to do demos at these -- way more than we have space for. If you're interested to demo, please tell us in the RSVP form and share a quick Loom. We're excited to keep building this community together in 2026. Thanks as always to SVB, Technical Integrity, Ombud, Matchstick Ventures, Drive Capital and Freeplay for making this happen.

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • Check out the latest from our team! Automate all the tedious parts of your AI ops workflows: Building queues for data review, updating datasets, triggering emails... And send Slack notifications whenever you need to. 🙌

    If you're building production AI agents, you've probably learned the secret to AI quality is to "look at lots of data." You don't want to spend all your time searching for data to look at. We build Freeplay's new Automations feature so you can trigger evals, update datasets, curate review queues, or send Slack notifications based on custom filters for your logs. The basic idea: Build custom filters for production logs ⏩ pick an action ⏩ set a schedule. Then... * Review queues populate themselves for human reviewers. * Complex eval logic triggers for the edge cases or issues you care about most. * Test datasets stay fresh with production examples. * Slack pings your team when something's off. Our team's been focused on speeding up the Ops workflow for AI engineering teams, and this is a small (but very cool!) new tool to help. Shoutout to Aditya Pandey Rob Rhyne Nico Tonozzi especially for bringing this to life!

  • We just published a new case study with Chime. In it, they share lessons learned scaling evals and AI quality practices across their teams. Lots of teams are now at the stage of trying to figure out how they formalize operations for their AI products. Chime saw the opportunity early to involve domain experts, and what they've done now looks like a best practice in the industry. If you've heard about the importance of getting PMs or domain experts involved in evaluation and prompt engineering, this is a great case study in how to do it. Our whole team at Freeplay has been proud to support them on their journey.

    We're excited to share our latest customer case study with Chime. It talks about how they've scaled production "AI engineering" by empowering domain experts and engineers to collaborate. Chime is the #1 most-loved banking app, serving millions of members with a mission to deliver helpful, transparent, and fair financial services. They went public last year as CHYM. Going back to 2024, their AI team set a clear goal: scale production AI use cases across the company without compromising quality. This was especially important in a regulated environment where trust and accuracy aren't negotiable. Most teams hit a wall here. Domain experts know what "good" looks like, but all changes go through engineering. Iteration often slows to a crawl as a result. Chime broke through by rethinking who owns what. Engineering builds the pipelines, and domain experts own prompt performance, evals, and ground truth. The two groups work in parallel, not in sequence. And as a result, they've been able to do more with AI to better serve their members. Chime has now scaled production AI across a range of use cases. In just one financial crimes example described in the case study, they achieved: ✅ 40% efficiency gains ✅ 99%+ quality scores, outperforming human agents ✅ Millions of $$ in OpEx savings ✅ Domain experts running evals and shipping prompt improvements independently Huge credit to the Chime team for building a true cross-functional AI practice -- and for showing others how it can be done. We're proud that Freeplay serves as the ops layer where their Engineering, Product, and Operations teams collaborate. If you're scaling AI in a complex domain and want to see what's possible when you empower domain experts alongside engineering, the full case study is in the comments. 👇

    • No alternative text description for this image
  • Freeplay reposted this

    The AI engineering teams moving fastest on building high-quality agents have all figured out the same thing: They're not building one system. They're building two. After three years of talking to hundreds of product & engineering teams, we've realized it's essential to make this distinction explicit. There's the App Stack: models, prompts, tools, retrieval, orchestration. These are what most people think about as the components of an agent. They're the obvious parts that produce outputs. And then there's the Ops Stack: evals, test datasets, production traces, human annotations. The things that make up your testing harness for an agent, and that help you understand how it actually behaves in production. These are the parts that tells you if those agent outputs are any good. Most teams over-invest in the first and cobble together the second -- until they hit a breaking point, and quality becomes the most important thing to fix. Alternatively, the teams that invest in the Ops Stack as a core part of their success early on don't just avoid the plateau. They discover they can compound improvements: Production traces surface failures. Failures become future test cases. Evals score every change or new version on those test cases before they ship. Each iteration cycle produces better data, better evals, and a faster iteration the next time. For any engineers & PMs thinking about these dynamics, we wrote down some more thoughts on our blog -- link in the comments. Let me know what you think.

    • No alternative text description for this image

Similar pages

Browse jobs

Funding

Freeplay 2 total rounds

Last Round

Seed

US$ 5.6M

See more info on crunchbase