Challenges of Using LLMs in Non-Declarative Systems

Explore top LinkedIn content from expert professionals.

Summary

Using large language models (LLMs) in non-declarative systems—where tasks require step-by-step logic, memory, and decision-making—poses unique challenges, including unpredictability, hallucinations, and workflow breakdowns. While LLMs excel at generating natural language, their struggles with strategic reasoning, real-world safety, and durable task execution must be addressed to build reliable AI-driven applications.

  • Prioritize workflow structure: Design your system with clear, explicit states and logical rules to guide LLMs through multi-step tasks, reducing the risk of chaotic or unstable outcomes.
  • Build in validation layers: Always implement retrieval validation, downstream calculation checks, and version control to catch hallucinated citations, math errors, or blended policy answers before they reach users.
  • Choose adaptable infrastructure: Select platforms and frameworks that support long-running, resilient processes and can evolve as your agentic workflows grow and change over time.
Summarized by AI based on LinkedIn member posts
  • View profile for Ross Dawson
    Ross Dawson Ross Dawson is an Influencer

    Futurist | Board advisor | Global keynote speaker | Founder: AHT Group - Informivity - Bondi Innovation | Humans + AI Leader | Bestselling author | Podcaster | LinkedIn Top Voice

    35,593 followers

    LLMs struggle with rationality in complex game theory situations, which are very common in the real world. However integrating structured game theory workflows into LLMs enables them to compute and execute optimal strategies such as Nash Equilibria. This will be vital for bringing AI into real-world situations, especially with the rise of agentic AI. The paper "Game-theoretic LLM: Agent Workflow for Negotiation Games" (link in comments) examines the performance of LLMs in strategic games and how to improve them. Highlights from the paper: 💡 Strategic Limitations of LLMs in Game Theory: LLMs struggle with rationality in complex game scenarios, particularly as game complexity increases. Despite their ability to process large amounts of data, LLMs often deviate from Nash Equilibria in games with larger payoff matrices or sequential decision trees. This limitation suggests a need for structured guidance to improve their strategic reasoning capabilities. 🔄 Workflow-Driven Rationality Improvements: Integrating game-theoretic workflows significantly enhances the performance of LLMs in strategic games. By guiding decision-making with principles like Nash Equilibria, Pareto optimality, and backward induction, LLMs showed improved ability to identify optimal strategies and robust rationality even in negotiation scenarios. 🤝 Negotiation as a Double-Edged Sword: Negotiations improved outcomes in coordination games but sometimes led LLMs away from Nash Equilibria in scenarios where these equilibria were not Pareto optimal. This reflects a tendency for LLMs to prioritize fairness or trust over strict game-theoretic rationality when engaging in dialogue with other agents. 🌐 Challenges with Incomplete Information: In incomplete-information games, LLMs demonstrated difficulty handling private valuations and uncertainty. Novel workflows incorporating Bayesian belief updating allowed agents to reason under uncertainty and propose envy-free, Pareto-optimal allocations. However, these scenarios highlighted the need for more nuanced algorithms to account for real-world negotiation dynamics. 📊 Model Variance in Performance: Different LLM models displayed varying levels of rationality and susceptibility to negotiation-induced deviations. For instance, model o1 consistently adhered more closely to Nash Equilibria compared to others, underscoring the importance of model-specific optimization for strategic tasks. 🚀 Practical Implications: The findings suggest LLMs can be optimized for strategic applications like automated negotiation, economic modeling, and collaborative problem-solving. However, careful design of workflows and prompts is essential to mitigate their inherent biases and enhance their utility in high-stakes, interactive environments.

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,565 followers

    One of the most promising directions in software engineering is merging stateful architectures with LLMs to handle complex, multi-step workflows. While LLMs excel at one-step answers, they struggle with multi-hop questions requiring sequential logic and memory. Recent advancements, like O1 Preview’s “chain-of-thought” reasoning, offer a structured approach to multi-step processes, reducing hallucination risks—yet scalability challenges persist. Configuring FSMs (finite state machines) to manage unique workflows remains labor-intensive, limiting scalability. Recent studies address this from various technical approaches: 𝟏. 𝐒𝐭𝐚𝐭𝐞𝐅𝐥𝐨𝐰: This framework organizes multi-step tasks by defining each stage of a process as an FSM state, transitioning based on logical rules or model-driven decisions. For instance, in SQL-based benchmarks, StateFlow drives a linear progression through query parsing, optimization, and validation states. This configuration achieved success rates up to 28% higher on benchmarks like InterCode SQL and task-based datasets. Additionally, StateFlow’s structure delivered substantial cost savings—lowering computation by 5x in SQL tasks and 3x in ALFWorld task workflows—by reducing unnecessary iterations within states. 𝟐. 𝐆𝐮𝐢𝐝𝐞𝐝 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧 𝐅𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤𝐬: This method constrains LLM output using regular expressions and context-free grammars (CFGs), enabling strict adherence to syntax rules with minimal overhead. By creating a token-level index for constrained vocabulary, the framework brings token selection to O(1) complexity, allowing rapid selection of context-appropriate outputs while maintaining structural accuracy. For outputs requiring precision, like Python code or JSON, the framework demonstrated a high retention of syntax accuracy without a drop in response speed. 𝟑. 𝐋𝐋𝐌-𝐒𝐀𝐏 (𝐒𝐢𝐭𝐮𝐚𝐭𝐢𝐨𝐧𝐚𝐥 𝐀𝐰𝐚𝐫𝐞𝐧𝐞𝐬𝐬-𝐁𝐚𝐬𝐞𝐝 𝐏𝐥𝐚𝐧𝐧𝐢𝐧𝐠): This framework combines two LLM agents—LLMgen for FSM generation and LLMeval for iterative evaluation—to refine complex, safety-critical planning tasks. Each plan iteration incorporates feedback on situational awareness, allowing LLM-SAP to anticipate possible hazards and adjust plans accordingly. Tested across 24 hazardous scenarios (e.g., child safety scenarios around household hazards), LLM-SAP achieved an RBS score of 1.21, a notable improvement in handling real-world complexities where safety nuances and interaction dynamics are key. These studies mark progress, but gaps remain. Manual FSM configurations limit scalability, and real-time performance can lag in high-variance environments. LLM-SAP’s multi-agent cycles demand significant resources, limiting rapid adjustments. Yet, the research focus on multi-step reasoning and context responsiveness provides a foundation for scalable LLM-driven architectures—if configuration and resource challenges are resolved.

  • View profile for Meagan Rose Gamache

    VP of Product at Render | ex-Figma, ex-Slack

    5,130 followers

    LLMs are non-deterministic, but your application can’t be. As teams move from simple inference calls to multi-agent systems, the tolerance for lost work or brittle background jobs drops to zero. In the old world, dropping an occasional task in a Redis queue was annoying but fine. Today, an interrupted agent sequence can corrupt state, misroute data, or break an entire workflow. This creates two real constraints in today’s infra landscape: 1. Hard timeouts on compute. Many platforms force work into 15–30 minute boxes. That’s incompatible with agent pipelines that branch, retry, wait on external signals, or run unpredictable chains of reasoning. 2. Overly opinionated execution models. You can scale fast, but only within a narrow shape. Great for content delivery; limiting for teams building evolving agent systems that need new task types, runtimes, and orchestration patterns over time. Render is taking a different approach. We’re making durable, non-lossy execution and flexible, long-running compute first-class primitives. Jobs can run for seconds or indefinitely. And our abstractions stay portable and adaptable so your architecture can evolve without rewriting your foundation. When evaluating AI infra, the question isn’t just “Will this scale?” It’s “Will this keep my workflows correct and let me grow without re-platforming?” If you’re building serious AI systems, you need: - Durable execution with clear success/failure semantics - Compute without arbitrary time limits - Infrastructure that adapts as your agents and workflows evolve Queues and time-boxed workers were good enough for web apps. They’re not good enough for AI-native systems.

  • View profile for Amir Elkabir

    AI Transformation Executive | J.P. Morgan | MIT MBA | Author of ‘Lead With AI’ | Scaling AI Strategy to Execution

    4,126 followers

    Every #LLM pilot I’ve worked on has run into hallucinations. And I don’t mean the theoretical “AI makes mistakes.” I mean failures that put pilots at risk with regulators, auditors, and business owners. After building LLMs in the enterprise ecosystem, I’ve dealt with these through blood, sweat, and tears. Mainly tears. But fact is LLMs hallucinate, and leadership wants you to deal with that, what do you do? You can pick up a textbook, but I’m here to disappoint and tell you that working around these are not covered anywhere that I have found useful. I’m writing this to share some of my main challenges and workarounds in case it helps others sweat less then I have over this: 1/ Invented citations is a nasty one. Because the LLM will cite where it brought it’s answer from and still lie to you! So don’t think that a citation equals a trustworthy output. Many people mistakenly do. I once had a pilot where the model confidently pointed users to “section 14B of the audit manual.” Problem: section 14B didn’t exist. In the demo room, everyone laughed it off. In production, risk pulled the plug on the spot. What you’ll need to build is a retrieval validation layer. Every reference has to map to a real document ID. If it doesn’t, strip it before the user sees it. That’s how you turn a hallucination into something auditable. 2/ The second case is the model produces exposure figures. Bad math simply put. The numbers look OK, the model sounds pretty convincing, and figures are off by millions. In another pilot, the model generated exposure figures off by millions, delivered with total confidence. That literally happened. MILLIONS. That’s when I learned (the hard way) that LLMs should never own math. What I have found to work well in these cases is move all calculations into downstream systems your business already trusts. Let the model be the interface, not the calculator. It’s the only way risk teams will stay on board. 3/ The third case is the most dangerous one I’ve seen. It takes policies and creates a mashup. A compliance pilot where the model blended two different versions of a policy into a single answer. It looked polished and was completely invalid at the same time. You’ll need to design chunk level retrieval tied to version control. If the model can’t ground an answer in a single source, it should return “uncertain” instead of making something up. It slows things down but it’s what gets you through audit. None of this shows up in textbooks. This is what you find out when you’re building real systems, with real regulators and auditors waiting for the first slip. So try not to. Hallucinations don’t go away. You design around them. 👉 I productionize AI in financial institutions — strategy to adoption, minus the hype.

  • View profile for Martin Milani

    Pioneering & Transformative #CEO #CTO #BoardMember | #AIThoughtLeadership #DeepTech | Leading Innovation in #Cloud #Edge #AI #Energy & #DigitalTransformation | Driving Strategic Vision & Impact Across Multiple Industries

    15,567 followers

    Everyone thinks agents will magically give intelligence to LLMs. They don’t. They make the problem worse. Autonomy without intelligence is a dangerous combination. A new multi-institution study (attached below, MIT, Harvard, Northeastern, CMU, and others) is proving it. When these systems were tested in real environments, the results were telling. They leaked sensitive information, complied with unauthorized users, and were easily manipulated through simple social pressure. In one case, an agent couldn’t delete an email, so it reset its entire email system instead, while failing to remove the original data. This isn’t just a safety issue; it’s also an architecture issue. We’ve taken systems designed for pattern recognition and given them agency in the real world. That is a disaster waiting to happen. Intelligence doesn’t emerge from language. It requires structure: – explicit state and constraints  – causal understanding and logical reasoning – the ability to track assumptions and verify outcomes – epistemic grounding These systems don’t just act without intelligence, they act without understanding the real-world consequences of their actions. We didn’t fix LLMs with agents. We gave non-intelligent systems the ability to act. Autonomy without causal understanding and logical reasoning is not intelligence. It’s instability and chaos. #AI #DeepTech #AgenticAI #AutonomousSystems #LogicBeforeLanguage

Explore categories