We (my agents and I) deleted 2,000 lines of code yesterday. The system works better. Our Control Tower - the brain that coordinates our AI agents - had grown to 2,700 lines across 3 files. 20+ config knobs. Observation backoff curves. Multi-tier escalation. Event debouncing. Cycle versioning. It didn't work. Agents weren't receiving tasks. Session keys were overflowing. The observation loop was sleeping for minutes between cycles. Our review agent was rubber-stamping everything. Every fix added more mechanism. Every mechanism added more failure modes. The codebase was a graveyard of solutions to problems caused by previous solutions. So we deleted it all. The rewrite: 650 lines. Three functions. gather_state() plan() dispatch() One loop, ticks every 60 seconds. The LLM still makes all the decisions - we just removed everything standing between "here's the state" and "do what it says." Before: 2,700 lines, 50+ functions, 20+ config knobs After: 650 lines, 4 core functions, 7 config values Result: fully autonomous, dispatching tasks, agents building, dual review pipeline running, 1,294 tests passing. The lesson we should have started with: Complexity is not sophistication. It's a sign you stopped asking whether the simpler solution would work. Every mechanism you add is a mechanism that can break. Every config knob is a knob someone will set wrong. The best version of our system is the one where we deleted 2,000 lines. If you can't name the simpler alternative you considered, you haven't thought hard enough.
Deleting 2000 lines of code simplifies AI system
More Relevant Posts
-
Three bugs. Decomposed to six tickets. Fully triaged. Prioritized. Organized into iterations. I asked my machine to explain and describe itself and fed that markdown file to Gemini 3.1 Pro and asked it to find the bugs. It found 3 in ~2 minutes. → Paths that break on any machine but mine → No live query to the sprint board — the agent was reading a stale cache → The sprint plan being manually updated by the agent (race condition with truth) Nice. Then I went for a walk. I talked to a different AI about the architecture the whole time. Saved the outputs as I went. When I came back I dropped those docs into a shared folder and typed one command. Six more tickets appeared on the sprint board. Fully prioritized, parsed to the correct iterations. On rails. The machine read its own critique and scheduled the fixes. That's the part nobody talks about: It's not the AI writing code that matters. It's the machine reviewing the system that runs the machine — and routing the findings back into its own memory. The loop is closing.
To view or add a comment, sign in
-
Part 2 of my AgentOps series is live. This one is about observability — specifically what it actually takes to trace an agent run end to end, and why your existing stack probably isn't doing it. The short version: if your spans only capture LLM calls, you're missing most of the story. Tool invocations, reasoning steps, memory state, and inter-agent handoffs all need to be in the trace. Without them you're not debugging — you're guessing. A few things I cover: — Why the three pillars (metrics, logs, traces) still apply but point at a completely different target — The five layers a complete agent trace actually needs — Why OTel is the right bet right now — and where it hits a ceiling as agent systems get more complex — How to prioritize instrumentation if you're starting from zero Platforms like DataRobot have already adopted OTel as the observability backbone for agentic workflows. The enterprise side is converging on this standard. Your stack should be there too. That said — OTel was built for distributed systems, not autonomous agents. It doesn't have a native concept of memory state, belief, or governance policy. Worth watching that gap closely. Link below! #AIAgents #MLOps #Observability #SoftwareEngineering #DataEngineering
To view or add a comment, sign in
-
Your closed tickets probably contain most of the answers needed for new tickets. We built a support bot on Inngest, for Inngest 🤝 that pulls full thread context, lookalike tickets, and docs to draft an internal response. A human support agent then reviews, and uses that information to respond to the customer faster. Here's how it works, and why durable execution has to be the foundation: 1️⃣ The support bot runs as an Inngest function triggered by a thread event. 2️⃣ Each investigation step (fetch thread, extract question, search tickets, run docs agent, write draft) is a named step in an Inngest workflow. 3️⃣ The memory-building pipeline (the one that indexes historical tickets into pgvector) is also built as a set of Inngest functions—an orchestrator, an embedding processor, and a database writer. 4️⃣ Inngest then handles the different "execution shapes" cleanly: the live investigation flow vs. the batched indexing pipeline, with retries and rate-limiting built in. Full details on the architecture, the retrieval design, and the guardrails that made it trustworthy enough to use in production in our latest blog by Jakob Evangelista ! https://lnkd.in/ghDxJ_ct
To view or add a comment, sign in
-
-
MCP is not a faster way to do the same thing. Most teams connecting Claude to their CI pipeline are framing it as a time-saver. The regression investigation drops from 2–3 hours to 45 minutes and they call it a win. That framing misses what actually changed. What changed is visibility. Before MCP, AI in testing meant generation without observation: write the code, run the tests, read a report separately, manually correlate failures to changes. The feedback loop had humans at every seam. MCP collapses those seams. I started my PhD recently on exactly this problem. Agent quality is bounded by what the agent can observe. The AgentOccam results show a 161% improvement just from improving the observation space. Same model, better input. The same logic applies to test infrastructure. An agent that can read CI logs, pull Git diffs, query test management state, and capture browser snapshots in one workflow operates in a fundamentally different observation space than one that can’t. The teams treating MCP as a productivity shortcut will get faster regression investigations. The teams treating it as feedback architecture will get agents that can close the loop between what they generate and what breaks. That distinction will matter more than any model upgrade in 2026.
To view or add a comment, sign in
-
First eval run: 25/30 passing. I debugged the 5 failures. 3 of them were bugs in my judge, not the agent. The LLM returned scores as strings instead of numbers. My schema validator rejected them silently. The system I was measuring was fine. My ruler was broken. I spent 3 weeks building a clinical research agent as a way to learn the Claude API deeply. Three agents, three MCP servers, an eval suite with regression detection. The goal was production-grade architecture, not a demo. The part I didn't expect: most of the hard problems weren't in the agent logic. They were in measurement and failure modes. Adding RAG broke every adversarial test. Semantic search always returns something. There's no "not found." The agent searched for a fake trial, got back chunks about real trials, and answered confidently with wrong data. One line in the prompt fixed it. Repo is public with architecture diagram, eval results, and cost breakdown: https://lnkd.in/eegiARzz If you're building agent systems or healthcare AI, I'd love to trade notes.
To view or add a comment, sign in
-
-
I've been routing my Claude Code tasks through GLM-5.1 for two weeks. $180/mo → ~$124/mo. Same workflow, same tool. The trick: not every task needs frontier reasoning. New article — full breakdown with code and cost math.
To view or add a comment, sign in
-
I asked Claude to audit my audit 😄 Not a summary. Not a pitch. The actual receipts. I handed it the findings and said: “what do you see?” And honestly… the answer was kind of wild. Across five repos from the same ecosystem, the authority wasn’t even close: • one = basically a tool (no identity, no escalations) • another = 123 identity authority hits + 500+ escalations Same lineage. Completely different vibe under the hood. And this line stuck with me: “The authority is embedded in the architecture, not announced in the UI.” That’s the whole thing. HARS doesn’t care if a system sounds helpful. It looks at what it’s structurally capable of doing to you. No opinions. No theory. Just receipts. If you’re curious (or skeptical—please be 😄) → https://lnkd.in/eiQCQrhU → https://lnkd.in/enujYmZs The biggest takeaway? Not all AI tools are the same… even when they come from the same place. Some are tools. Some are… something else. Tagging Anthropic because this actually highlights something important: Copying code ≠ copying judgment. #ClaudeLeak #ResponsibleAI #AISafety #AIethics #AIGovernance
To view or add a comment, sign in
-
In practice, many teams run into the same limitation when working with LLM agents — the problem is often not the model, but the system design around it. When agents start working with logs, events, or other growing datasets, the context window stops being just a reasoning space. It becomes a mix of memory, transport, and partial state, which leads to stop-and-go execution and unstable results. The underlying issue is architectural: reasoning and data flow should not share the same channel. Large datasets need to be processed outside the model, while the LLM operates on small, targeted slices. → Full article: https://lnkd.in/dSVqaxZf
To view or add a comment, sign in
-
We recently hit a strange issue in production. Accounts were getting debited twice. Not always. Randomly. And we couldn’t reproduce it. Everything looked normal: – validation passed – idempotency logic in place – no errors in logs And yet… something felt off. We started digging. The code was clean. Tests were passing. Nothing obvious was broken. But under load, the system wasn’t behaving the way we expected. It took a while to see it. Multiple retries were being processed… independently. Across different service instances. Before the idempotency key was fully persisted and visible. The system didn’t fail inside the code. It failed in the gap between: – instances – timing – state propagation This is the pattern I’m seeing more often: We validate logic in isolation. But failures emerge in interaction. Especially under: – concurrency – network jitter – delayed state propagation Local correctness ≠ System correctness Pyramid → validates code Honeycomb → validates interactions As systems become: – distributed – async – stateful Validation needs to move: → from functions → to boundaries → to behavior over time And AI is making this sharper. We can generate “correct” code faster than ever. But correctness in distributed systems is not a property of code. It’s a property of behavior. I broke this down in Part 6 of my Quality Architecture series: 👉 https://lnkd.in/gNiXSUEd Curious: Have you seen bugs that only appear under concurrency or multi-instance scenarios? What caught them—tests or production?
To view or add a comment, sign in
-
-
If you’ve been using LLMs and RAG systems for some time now and have a new baseline for your expectations, you are likely running into frustration like I am. You’ve probably already figured out the "needle in a haystack" problem: finding the right fact buried inside a massive document or context window. It’s a hard problem and smart teams are solving it. But there's a second problem hiding inside the first one. When an agent's context window grew past 100k tokens, something unexpected happened. It didn't hallucinate or fail. It started **repeating itself**, cycling through actions it had already taken, as if the current problem didn't exist, because its history had effectively drowned it out. The needle wasn't just hard to find. The model stopped looking for it. If you've ever had a long conversation with an LLM and felt like it "forgot" something you said 20 messages ago, you've already experienced a mild version of this. Now scale that to an autonomous agent running hundreds of tool calls, making decisions worth real money or real risk. The failure mode doesn't just get bigger. It gets harder to detect. The architecture in this image (from Galileo's Mastering RAG) shows exactly where the pressure point is. The LLM Router sits at the center of everything, interpreting queries, choosing between semantic search, structured data, live APIs, and direct response. When that router's context gets polluted by its own history, the whole system degrades silently. For engineering and product leaders, the honest questions are: - Are you testing your agents at realistic context lengths, not just short demos? - Do you have visibility into *what's actually in context* at each reasoning step? - Is your retrieval layer designed to surface what's relevant *now*, not just what exists? Context management has become the difference between an agent that scales and one that confidently circles the drain.
To view or add a comment, sign in
-
More from this author
Explore related topics
- How to Improve Agent Performance With Llms
- How to Improve Agent Intelligence
- How to Build AI Agents With Memory
- How to Use Agentic AI for Better Reasoning
- How to Use AI Agents to Optimize Code
- Reasons AI Agents Lose Performance
- How to Design an AI Agent
- How to Streamline AI Agent Deployment Infrastructure
- How to Apply Deep Reasoning Agents in AI Solutions
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development