Multihop Reasoning Using Small Language Models

Explore top LinkedIn content from expert professionals.

Summary

Multihop reasoning using small language models is an approach where compact AI systems solve complex, multi-step problems by breaking them down into manageable parts and connecting information across multiple sources or logical steps. This method allows smaller models to perform sophisticated reasoning tasks that previously required much larger AI systems.

  • Divide complex questions: Break down challenging queries into smaller, connected steps so small language models can resolve each part and build towards a comprehensive answer.
  • Use structured data: Rely on knowledge graphs or external memory to help models trace relationships and ground their reasoning, reducing errors and improving reliability.
  • Reward intermediate progress: Encourage models to reflect and improve on each step in a reasoning chain rather than focusing only on the final answer, leading to more robust problem-solving.
Summarized by AI based on LinkedIn member posts
  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    41,842 followers

    Researchers from Oxford University just achieved a 14% performance boost in mathematical reasoning by making LLMs work together like specialists in a company. In their new MALT (Multi-Agent LLM Training) paper, they introduced a novel approach where three specialized LLMs - a generator, verifier, and refinement model - collaborate to solve complex problems, similar to how a programmer, tester, and supervisor work together. The breakthrough lies in their training method: (1) Tree-based exploration - generating thousands of reasoning trajectories by having models interact (2) Credit attribution - identifying which model is responsible for successes or failures (3) Specialized training - using both correct and incorrect examples to train each model for its specific role Using this approach on 8B parameter models, MALT achieved relative improvements of 14% on the MATH dataset, 9% on CommonsenseQA, and 7% on GSM8K. This represents a significant step toward more efficient and capable AI systems, showing that well-coordinated smaller models can match the performance of much larger ones. Paper https://lnkd.in/g6ag9rP4 — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai

  • View profile for Vignesh Kumar
    Vignesh Kumar Vignesh Kumar is an Influencer

    AI Product & Engineering | Start-up Mentor & Advisor | TEDx & Keynote Speaker | LinkedIn Top Voice ’24 | Building AI Community Pair.AI | Director - Orange Business, Cisco, VMware | Cloud - SaaS & IaaS | kumarvignesh.com

    20,980 followers

    🚀 Why RAG alone won’t get us there—and how Agentic RAG helps I've used RAG systems in multiple products—especially in knowledge-heavy contexts. They help LLMs stay grounded by retrieving supporting documents. But there’s a point where they stop being useful. Let me give you a simple example. Let’s say you ask: 👉 “Which medical researchers have published on long COVID, what clinical trials they were part of, and what other conditions those trials studied?” A classical RAG system would: 1️⃣ Look for text chunks that match “long COVID” 2️⃣ Return some papers or abstracts 3️⃣ And leave the LLM to guess or hallucinate the rest And here is the problem? You're not just looking for one passage. You're asking for a chain of connected facts: 🔹 Authors → 🔹 Publications → 🔹 Clinical trials → 🔹 Other conditions RAG systems were never built to follow that trail. They do top-k lookup and feed static chunks to the LLM. No planning. No reasoning. No ability to explore relationships between entities. That’s where Agentic RAG with Knowledge Graphs comes in. Instead of dumping search results, the system: ✅ Breaks the question into steps ✅ Uses structured data to navigate relationships (e.g., author–trial–condition) ✅ Assembles the answer using small, verifiable hops ✅ Uses tools for hybrid search, graph queries, and concept mapping You can think of it like this: A classical RAG is like searching through a pile of papers with a highlighter and Agentic RAG is like giving the job to a smart analyst who understands the question, walks through your research database, and explains how each part connects. I am attaching a paper I read recently that demonstrated this well—they used a mix of Neo4j for knowledge graphs, vector stores for retrieval, and a lightweight LLM to orchestrate the steps. The key wasn’t the model size—it was the structure and reasoning behind it. I believe that this approach is far more suitable for domains where: 💠 Information lives across connected sources 💠 You need traceability 💠 And you can’t afford vague or partial answers I see this as a practical next step for research, healthcare, compliance, and enterprise decision-support. #AI #LLM #AgenticRAG #KnowledgeGraph #productthinking #structureddata I write about #artificialintelligence | #technology | #startups | #mentoring | #leadership | #financialindependence   PS: All views are personal Vignesh Kumar

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    15,979 followers

    Exciting Breakthrough in AI Reasoning: MCTS-RAG Combines Tree Search with Retrieval I just came across a groundbreaking research paper that introduces MCTS-RAG, a novel approach that significantly enhances the reasoning capabilities of small language models on knowledge-intensive tasks. What makes MCTS-RAG special is how it dynamically integrates retrieval and reasoning through an iterative decision-making process. Unlike standard RAG methods that retrieve information independently from reasoning, or conventional MCTS reasoning that relies solely on internal model knowledge, MCTS-RAG combines structured reasoning with adaptive retrieval. >> How MCTS-RAG Works Under the Hood: The system employs a sophisticated action space with six distinct operations: - Direct Answer: Provides immediate responses for straightforward queries - Quick Reasoning: Executes rapid reasoning steps based on current context - Decompose Question: Breaks complex queries into manageable sub-questions - Retrieval Reasoning: Actively retrieves relevant knowledge before proceeding - Retrieval Decompose: Integrates decomposition with retrieval for complex problems - Summarized Answer: Synthesizes results from previous reasoning and retrieved information The retrieval process itself is incredibly nuanced, with four key steps: 1. Query Generation: Detecting knowledge gaps and generating targeted search queries 2. Query Execution: Using external retrieval tools to obtain relevant information 3. Knowledge Reflection: Evaluating retrieved data for relevance and consistency 4. Summary Answer: Integrating refined information to advance reasoning The researchers from Yale University and New York University demonstrated impressive results across multiple benchmarks, including ComplexWebQA, GPQA, and FoolMeTwice. Their approach enabled small-scale language models (7-8B parameters) to achieve performance comparable to frontier LLMs like GPT-4o. Most impressively, MCTS-RAG achieved over 20% improvement with Llama 3.1-8B and 6% with Qwen2.5-7B on complex reasoning tasks, outperforming other methods like Standard RAG, ReAct, Self-Ask, and Search-O1. This research represents a significant step forward in making smaller language models more capable through intelligent reasoning and knowledge integration. The code is available on GitHub for those interested in exploring this approach further.

  • View profile for Anthony Alcaraz

    GTM Agentic Engineering @AWS | Author of Agentic Graph RAG (O’Reilly) | Business Angel |

    46,747 followers

    Agentic systems don't just benefit from Small Language Models. They architecturally require them, paired with knowledge graphs. Here's the technical reality most teams miss. 🎯 The Workload Mismatch Agents execute 60-80% repetitive tasks: intent classification, parameter extraction, tool coordination. These need <100ms latency at millions of daily requests. Physics doesn't negotiate. Model size determines speed. But agents still need complex reasoning capability. 🧠 The Graph Solution The breakthrough: separate knowledge storage from reasoning capability. LLMs store facts in parameters. Inefficient. Graph-augmented SLMs externalize knowledge to structured triples (entity-relationship-entity), use 3-7B parameters purely for reasoning. Knowledge Graph of Thoughts: Same SLM solves 2x more tasks when querying graphs vs. processing raw text. Cost drops from $187 to $5 per task. Multi-hop reasoning becomes graph traversal, not token generation. Token consumption drops 18-30%. Hallucination reduces through fact grounding. 💰 The Economics At 1B requests/year: GPT-5 approach: $190K+ 7B SLM + graph infrastructure: $1.5-19K One production system: $13M annual savings, 80%→94% coverage by caching knowledge as graph operations. ⚡ The Threshold Below 3B parameters: Models can't formulate effective graph queries Above 3B: Models excel at coordinating retrieval and synthesis over structured knowledge Modern 7B models (Qwen2.5, DeepSeek-R1-Distill, Phi-3) now outperform 30-70B models from 2023 on graph-based reasoning benchmarks. 🏗️ The Correct Architecture Production agents converge on this pattern: Query → Classifier SLM → Graph construction/update → Specialist SLMs query graph → Multi-hop traversal → Response synthesis → (5% escalate to LLM) The graph provides: External memory across reasoning steps Fact grounding to prevent hallucination Reasoning scaffold for complex inference 🔐 Why This Matters Edge deployment: 5GB graph + 7B model runs locally on laptops Privacy: Medical/financial data never leaves premises Latency: Graph queries are deterministic <50ms operations Updates: Modify graph triples without model retraining Real case: Clinical diagnostic agent on physician laptop. Patient symptoms → graph traversal → diagnosis in 80ms. Zero external transmission. 🎓 The Separation of Concerns Graphs handle: relationship queries, continuous updates, auditability SLMs handle: query formulation, reasoning coordination, synthesis LLMs conflate both functions in one monolith. This drives their size and cost. Agent tasks follow this pattern: understand intent → retrieve structured knowledge → reason over relationships → execute action → update knowledge state. Graphs make each step explicit. SLMs provide coordination intelligence. Together, they outperform larger models on unstructured data at 10-36x lower cost. Are you still processing agent tasks with 70B+ models on raw text, or have you separated knowledge (graphs) from reasoning (SLMs)?

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,565 followers

    One of the biggest barriers to deploying LLM-based agents in real workflows is their poor performance on long-horizon reasoning. Agents often generate coherent short responses but struggle when a task requires planning, tool use, or multi-step decision-making. The issue is not just accuracy at the end, but the inability to reason through the middle. Without knowing which intermediate steps helped or hurt, agents cannot learn to improve. This makes long-horizon reasoning one of the hardest and most unsolved problems for LLM generalization. It is relatively easy for a model to retrieve a document, answer a factual question, or summarize a short email. It is much harder to solve a billing dispute that requires searching, interpreting policy rules, applying edge cases, and adjusting the recommendation based on prior steps. Today’s agents can generate answers, but they often fail to reflect, backtrack, or reconsider earlier assumptions. A new paper from Google DeepMind and Stanford addresses this gap with a method called SWiRL: Step-Wise Reinforcement Learning. Rather than training a model to get the final answer right, SWiRL trains the model to improve each step in a reasoning chain. It does this by generating synthetic multi-step problem-solving traces, scoring every individual step using a reward model (Gemini 1.5 Pro), and fine-tuning the base model to favor higher-quality intermediate steps. This approach fundamentally changes the way we train reasoning agents. Instead of optimizing for final outcomes, the model is updated based on how good each reasoning step was in context. For example, if the model generates a search query or a math step that is useful, even if the final answer is wrong, that step is rewarded and reinforced. Over time, the agent learns not just to answer, but to reason more reliably. This is a major departure from standard RLHF, which only gives feedback at the end. SWiRL improves performance by 9.2 percent on HotPotQA, 16.9 percent on GSM8K when trained on HotPotQA, and 11 to 15 percent on other multi-hop and math datasets like MuSiQue, BeerQA, and CofCA. It generalizes across domains, works without golden labels, and outperforms both supervised fine-tuning and single-step RL methods. The implications are substantial: we can now train models to reason better by scoring and optimizing their intermediate steps. Better reward models, iterative reflection, tool-assisted reasoning, and trajectory-level training will lead to more robust performance in multi-step tasks. This is not about mere performance improvement. It shows how we can begin to train agents not to mimic outputs, but to improve the quality of their thought process. That’s essential if we want to build agents that work through problems, adapt to new tasks, and operate autonomously in open-ended environments.

  • View profile for Philipp Schmid

    AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    165,267 followers

    How can we use tools during reasoning without multiple-turns? Reinforcement Learning! New paper ReSearch explores how we can train LLMs via RL to integrate search operations dynamically into their reasoning process. This significantly improves multi-hop question answering and showing strong generalization without needing supervised reasoning step data. Implementation 1️⃣ Define a reasoning+search format using special tags (e.g., <think>, <search>, <result>). 2️⃣ Choose a base or instruction-tuned LLM (like Qwen2.5) as the policy model. 3️⃣ Set up a Reinforcement Learning environment, e.g. with GRPO 4️⃣ Integrate an external search tool/retriever (like google search or wikipedia). 5️⃣ Create prompt templates for the LLM to follow the format 6️⃣ Define Rule-based Reward function on the final answer's correctness and format 7️⃣ Train the LLM policy using RL with GRPO but mask retrieved search results during loss calculation. Insights 💪 GRPO trained the LLMs to integrate search into multi-step reasoning. ✨ Didn’t require supervised data for intermediate reasoning or search steps. 🏆 ReSearch outperforms naive RAG and advanced iterative RAG baselines (Iter-RetGen, IRCoT). 🌍 Trained solely on the MuSiQue dataset demonstrate strong generalization to other multi-hop QA benchmarks. 🚀 Cold start from instruction-tuned models yielded better performance than from base models. 🔗 Search operations (<search>, <result>) are treated as integral parts of the generated reasoning chain (<think>). 🎯 Used rule-based reward function based on final answer F1 score and format correctness 🎭 Tokens within search results (<result>...</result>) are masked in the loss calculation. 🤔 Masking search results focuses the model on learning when and what to search, not predicting retrieved content. 📜 Used prompt templates guide the LLM to generate the required format (think/search/result/answer). 📈 Models learn to use more <search> operations throughout training. Paper: https://lnkd.in/ejJGPsaU Models: https://lnkd.in/ekvpWrEX

  • View profile for Pascal Biese

    AI Lead at PwC </> Daily AI highlights for 80k+ experts 📲🤗

    84,840 followers

    What if this is the next big step for LLMs? A new inference technique from Massachusetts Institute of Technology called Recursive Language Models (RLMs) is rethinking context windows entirely. The core problem is well-known: even frontier models like GPT-5 suffer from "context rot" - performance degrades quickly as prompts get longer, regardless of the technical context limit. Summarization helps but loses critical details. Retrieval misses complex reasoning patterns. Have we been trying to make models see more, when perhaps the answer is to make them see differently? RLMs treats the prompt not as direct neural network input, but as an external object the model interacts with programmatically. The prompt is loaded as a variable in a Python REPL environment, and the LLM writes code to peek into it, decompose it, and recursively call itself over smaller snippets. Same interface as a regular LLM, radically different execution. On information-dense tasks where GPT-5 scores below 0.1%, RLMs achieve 58%. On multi-hop research questions spanning 6-11M tokens, RLMs hit 91% accuracy while costing less than feeding the full context would. Crucially, performance degrades far more gracefully as complexity scales - the approach handles inputs two orders of magnitude beyond native context windows. This suggests that scaling context is not just an architecture problem but also an inference problem. RLMs demonstrate that letting models reason about their input symbolically rather than processing it neurally could be a promising new direction. If this approach generalizes, we may be looking at a new axis for scaling language model capabilities entirely. ↓ 𝐖𝐚𝐧𝐭 𝐭𝐨 𝐤𝐞𝐞𝐩 𝐮𝐩? Join my newsletter with 50k+ readers and be the first to learn about the latest AI research: llmwatch.com 💡

  • View profile for Wlodzislaw Duch

    Head, Neurocognitive Laboratory, Center for Modern Interdisciplinary Technologies, Nicolaus Copernicus University.

    5,210 followers

    Czy na pewno potrzebujemy elektrowni jądrowych i superkomputerów by tworzyć dobre systemy AI? Inspiracje neuro prowadzą do nowych rozwiązań. Grupa z Singapuru właśnie pokazała, jak malutki model (27 mln parametrów) trenowany na niewielkich danych (1000 przykładów) radzi sobie lepiej z trudnymi problemami niż największe i najdroższe modele. To się da zrobić nawet na notebooku. Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., Lu, M., Song, S., Yadkori, Y. A. (2025). Hierarchical Reasoning Model. https://lnkd.in/d9c5j74U Abstrakt: Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale processing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel recurrent architecture that attains significant computational depth while maintaining both training stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass without explicit supervision of the intermediate process, through two interdependent recurrent modules: a high-level module responsible for slow, abstract planning, and a low-level module handling rapid, detailed computations.  With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples. The model operates without pre-training or CoT data, yet achieves nearly perfect performance on challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes. Furthermore, HRM outperforms much larger models with significantly longer context windows on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial general intelligence capabilities. These results underscore HRM’s potential as a transformative advancement toward universal computation and general-purpose reasoning systems.   Summary: Hierarchical Reasoning Model (HRM): A new way for AI to think https://lnkd.in/dMsQ-6Wu https://lnkd.in/dy3irpum

  • View profile for Ravena O

    AI Researcher and Data Leader | Healthcare Data | GenAI | Driving Business Growth | Data Science Consultant | Data Strategy

    92,342 followers

    Google is betting big on small LLMs — and they just made a clever breakthrough. 💡 Training compact models (like 7B) on tough reasoning tasks has always been a balancing act: • SFT (Supervised Fine-Tuning) tends to overfit — models just mimic examples without truly “thinking.” • RL (Reinforcement Learning) often collapses — the model keeps guessing without ever landing on the right logic. So when tasks demand multi-step reasoning — like complex math or coding — even one wrong move derails the whole chain. Enter Google’s new method: Supervised Reinforcement Learning (SRL). Instead of treating reasoning as one big prediction, SRL breaks it down into logical steps. At each step, the model gets dense rewards for how close its reasoning matches expert behavior — not just for the final answer. That means even when outputs are “wrong,” the model still learns why. The payoff? • +3% over RLVR on challenging math benchmarks (with just 1,000 examples!) • +3.7% when combined with RLVR • 74% relative boost on SWE-Bench coding tasks In short: SRL teaches smaller models to think in steps, not guesses — making them smarter, leaner, and far more capable than before. Small models, big brains. 🧠✨

  • View profile for Fan Li

    R&D AI & Digital Consultant | Chemistry & Materials

    9,608 followers

    While large foundation models grow more powerful and dominate benchmarks, small specialized language models are quietly finding their place in industries. In industrial settings, practical constraints often favor smaller, locally deployed models: proprietary data must remain behind corporate firewalls, latency must be tightly controlled for manufacturing lines, and operating costs must stay manageable at scale. One strategy is to transfer capabilities from large models into smaller ones, as exemplified by #MolReasoner (https://lnkd.in/eUY7DDF9): A powerful teacher model generates chain-of-thought reasoning for molecule design tasks, which is then distilled into a smaller model through supervised fine-tuning and reinforcement learning with chemistry-specific rewards. A recent preprint introduces #Logos, a compact reasoning model for rational molecular design that follows the same three-step structure: 🔹Step 1 Reasoning distillation: A larger teacher model generates chain-of-thought explanations linking molecular descriptions to structural decisions (e.g., scaffold identification, functional group placement, property-driven modifications). These reasoning traces are paired with the final SMILES structures to create a reasoning dataset. 🔹Step 2 Supervised fine-tuning: The smaller model is trained on these samples to reproduce the full output format including a reasoning block followed by a structured output. This teaches the model to translate natural-language descriptions into molecular structures while exposing the intermediate design logic. 🔹Step 3 Reinforcement learning: The model is further optimized by generating multiple candidate molecules for each prompt and scoring them with chemistry-aware rewards. Candidates that satisfy chemical validity checks, match the ground-truth molecule in the benchmark, or show high fingerprint similarity receive higher rewards, gradually steering the model toward chemically valid and structurally accurate designs. On the standard ChEBI-20 benchmark, the 4B version sets a new high on exact match and structural accuracy, outperforming both general-purpose LLMs and prior specialist models. It also achieves near-perfect chemical validity on SMILES. The visible reasoning traces support the kind of human-in-the-loop iteration that industrial applications often require. How do you view the balance between large foundation models and smaller domain-specialized models? Where should each be used in different industrial settings? 📄 Logos: An evolvable reasoning engine for rational molecular design, arXiv, March 10, 2026 🔗 https://lnkd.in/ecnujTCj

Explore categories