Dima Galat
New York, New York, United States
3K followers
500+ connections
View mutual connections with Dima
Dima can introduce you to 10+ people at Satisfi Labs
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
View mutual connections with Dima
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
About
Dynamic technical leader with strong quantitative research skills and a proven track…
Experience
View Dima’s full profile
-
See who you know in common
-
Get introduced
-
Contact Dima directly
Other similar profiles
Explore more posts
-
Paul McDonald
Aligned Labs • 11K followers
Just-in-Time UI: At Aligned Labs we frequently share reports about the data collection, fine-tuning and eval work that we do with clients. They often need different views or refined analysis, which is why we built Whiz-bang Boom! Our new tool takes a CSV or JSON file and generates a shareable, interactive report. You can easily modify the report or focus the analysis with a simple prompt, and it creates a new revision. You can organize reports into spaces and customize those spaces with a simple prompt. Try it out. https://wizbangboom.com #businessIntelligence #AI
25
2 Comments -
Vasilije Markovic
cognee • 5K followers
Many of our users have been asking for stronger guarantees around structured outputs and less prompt drift as projects grow. We listened. Today we’re shipping BAML (by Boundary (YC W23)) support in cognee: type-safe, schema-aligned LLM calls with dynamic Pydantic→BAML mapping and validation, without having to change the design of your cognee pipelines. Flip a flag to switch between BAML and Instructor. With BAML in cognee, you get production hygiene: fewer parsing errors, clearer versioning, safer changes, faster iteration, and more. Read the full benefits from the link below. Huge thanks to the Boundary (YC W23) team for their support during integration. We deeply appreciate their work and we’re excited about what this update unlocks for cognee users worldwide.
62
12 Comments -
Ritvik Pandey
Pulse • 15K followers
Today the Pulse team published a deep dive on why a single “accuracy” score doesn’t tell you if a document extraction system will survive in production. The goal here is to lay out an introductory but still rigorous evaluation methodology - we have an exciting open-source benchmark building on this research coming out very soon. Let’s do the math: take 1,000 pages, each with 200 data elements. A model that’s 98% “accurate” on paper still produces 4,000 incorrect values. Now make some of those: 1/ Broken reading order that scrambles multi-column layouts 2/ Tables with shifted columns or missing headers 3/ Cross-page context lost entirely That’s enough to silently corrupt an entire dataset without throwing a single error. We’ve processed hundreds of millions of pages and built a multi-axis evaluation framework to measure what actually matters: reading order validation, region-level ANLS, reading order accuracy, TEDS for table structure, and continuity checks across page boundaries. The result? Fewer silent data corruptions, more predictable performance, and pipelines that keep working on the next million documents you haven’t seen yet. Full technical write up in the comments!
36
2 Comments -
Jeff Huber
Chroma • 7K followers
We just open-sourced an eval harness for retrieval — built for the community. Accurate retrieval is hard to build. Not because embeddings are complicated — but because there's no easy way to test what's actually working. Run a sweep across chunking strategies, embedding models, and search parameters. Find what actually moves your recall. What's your current approach to evaluating retrieval? Vibes-based testing or something more rigorous? Link to the repo in the comments 👇 🎁 Day 4 of 12
59
1 Comment -
Daniel Walsh
The AI Automators • 1K followers
Vector search is not a silver bullet for RAG. If your agent relies on similarity alone, you're leaving massive gaps in your retrieval. Vector search is brilliant for semantic queries. But entire categories of questions need exact match, structure, aggregation, graphs, and multimodal retrieval. In my latest video, I walk through 9 real-world cases, showing where vector search breaks and the retrieval engineering patterns that make agents reliable in production with n8n. We also share the hybrid retrieval n8n workflows we use. What you will learn: 🔹 Why similarity is not the same as relevance, and how to detect it 🔹 When to use keyword and pattern matching for IDs and error codes 🔹 Structured lookups and SQL for tabular questions 🔹 Reranking and metadata filtering for recency and specificity 🔹 Aggregation by counting and computing across records 🔹 GraphRAG for global patterns and multi-hop reasoning 🔹 Multimodal RAG to retrieve images and diagrams 🔹 Post-processing with tools for calculations and trend analysis 🔹 Handling false premise questions with verification and evals using DeepEval Generic chat with your data is easy. Production retrieval demands engineering. What retrieval challenges are breaking your RAG agents? Link to the full video in the comments 👇
23
5 Comments -
Victor Okolie
Aether Grid • 433 followers
The Rise of Autonomous Research Architectures The release of Andrej Karpathy’s autoresearch tool marks a pivotal shift in how we conceptualize AI-driven workflows. We are moving beyond simple RAG pipelines into the era of Autonomous Research Platforms—systems capable of generating hypotheses, running simulations, and producing reproducible code without human intervention. For architects and founders, this opens several high-value frontiers: 1. Research-as-a-Service (RaaS): By exposing simulation environments via MCP servers or APIs, startups can offer automated backtesting for everything from trading strategies to pricing models. This transforms static data into dynamic intelligence. 2. Autonomous Venture Studios: We are seeing the emergence of "Strategy Simulation Engines." These agents don't just brainstorm; they simulate market demand and rank startup concepts by success probability. This is the industrialization of the EPD lifecycle. 3. Self-Optimizing Systems: Perhaps the most technical opportunity lies in Prompt Evolution and Model Architecture agents. By implementing a "Generate-Test-Score-Repeat" loop, we can automate the optimization of machine learning architectures and prompt performance, effectively creating a Continuous Software Optimization Engine. The core takeaway is clear: the value is migrating from the model itself to the orchestration of the experimentation loop. Whether you are building for biotech (Drug Discovery) or finance (Autonomous Trading Labs), the goal is to minimize the cost of curiosity through agentic automation. We are no longer just building tools; we are building systems that build insights. check out auto research here: https://lnkd.in/dueYAXwv #AI #Autoresearch
2
-
Nishantha Ruwan
IWROBOTX Software Inc. • 2K followers
The authors present WIMHF, a method based on sparse autoencoders to uncover interpretable features in human preference feedback datasets. They apply it across seven datasets to reveal both what kinds of preferences the datasets are capable of measuring (e.g., length, tone, refusal behavior) and what annotators actually express. They find human feedback is far more diverse than commonly assumed, and reveal surprising patterns (for instance, some communities disfavor refusals and even favor toxic content). By identifying these features, WIMHF can support data curation and personalization: when harmful feedback examples were re‑labelled, significant safety gains resulted (a 37 % improvement) with no loss in general performance. Fine‑grained weights over subjective features also enable better per‐annotator preference prediction. The method thus provides a human‑centred way for practitioners to better understand and leverage preference data. Demo: https://lnkd.in/g7Zdbk6s Code: https://lnkd.in/gWCi4Jxf https://lnkd.in/g8btyw5d
-
Huy Nhat Phan
Drylab AI • 2K followers
Reimagining RL for Long-Running, Complex LLM Agent Tasks Traditional RL methods like GRPO + tool masking (e.g., Retool-RL) work fine—if your agent only needs to make 3–5 tool calls. Think toy tasks: simple calculator math or RAG for basic QA. But let’s be real: Real-world agent tasks—like debugging code or developing features—are messy. They involve 20+ tool calls, long reasoning chains, and cascading errors that even top-tier models (like Claude Sonnet-4 or GPT-4.1) struggle to recover from. 🧠 So why not just train with RL on these long tasks? Some teams tried (e.g., SkyRL)… but RL barely moved the needle: SWE-bench went from 11% → 14.6%. That’s not enough. Here’s the problem: 🔁 RL on long-horizon tasks is bottlenecked at rollout time. In my experiments training vision agents at LandingAI, rollout (generating experience traces) took 75%+ of the training time. That’s not because backprop is slow—updating the model is cheap. It’s because each task takes hundreds of seconds to complete, and future tasks will take days. 💣 And worse, because success rate/action are low (say 20%), the chance of getting a single successful N-step trace is 1 / (0.2^N) to create positive learning signal, but then this quickly vanish in bunch of negative gradient you got from other failed traces. So what actually works? Curriculum. Break down the task → Train specialists → Synthesize good traces → Bootstrap a generalist → RL fine-tune. This “specialist-to-generalist” recipe increases early success rates and makes RL training actually feasible. 🧪 This approach is gaining momentum: ether0 (FutureHouse): Train RL specialists on chemistry subtasks, synthesize CoT + tool traces, and bootstrap a generalist with SFT before final RL tuning. SWE-Swiss (https://lnkd.in/g4TCpgSz): Break SWE-bench into Localization, Repair, and Unit Test Generation → train specialists → synthesize traces → build a generalist → RL on full tasks. Result: state-of-the-art 60.2% on SWE-bench Verified 🧩 The New Recipe (summarized): Decompose long tasks into easier, verifiable subtasks Train specialist models with RL for each subtask Use them to generate clean traces (Chain-of-Thought + tool use) Train a new generalist model on these synthetic traces Apply RL to the generalist on full, long-running tasks This boosts early success rate → stronger learning signal → higher sampling efficiency. If you’re building RL agents for real-world workflows, this curriculum-based approach is worth a serious look. Happy to share ideas or diagrams if you’re working on something similar. Let’s push this frontier forward 🚀 We're building the next generation bioinformatic agents at https://www.thedrylab.com/
25
-
Nicholas Arcolano, Ph.D.
Jellyfish • 2K followers
When talking with folks about evaluating AI coding tools, one metric that comes up frequently is "acceptance rate". It sounds straightforward: track all of the code that AI writes, and use how often a human developer accepts or rejects proposed changes as a measure of tool efficacy. But if you stop for a moment to think about the difference between, say, inline autocomplete in Copilot versus interactive agentic coding across multiple files with Claude Code, things start to get... confusing. For example, Cursor suggestions get accepted 81% of the time, but developers only end up keeping about half the lines. "Accepted" and "kept" are very different things — and that gap is exactly why this is harder to measure than it looks. Are you as confused as I am? Check out this Jellyfish Research post by my colleague Tomas Pardiñas where he breaks down the different coding tools and the different ways we use them and talks about what "acceptance rate" means across various types of AI development workflows. Worth a read, especially if you're trying to make sense of your own team's numbers: https://lnkd.in/eRGK5TPZ
35
2 Comments
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top content