AI systems in medical devices can produce errors that look plausible but are factually incorrect. The FDA now calls these “hallucinations,” and they’re not always harmless. 1️⃣ Hallucinations are defined as plausible but incorrect outputs, split into two types: impactful (clinically harmful) and benign (harmless in practice) 2️⃣ An example of an impactful hallucination: an AI adds a fake tracheoesophageal fistula to a reconstructed image, potentially altering diagnosis and treatment 3️⃣ Unlike conventional artifacts (like aliasing), hallucinations can be subtle and hard to detect, especially dangerous if they mimic clinical reality 4️⃣ Hallucinations have been seen across medical AI applications: image reconstruction, synthetic data generation, and LLMs for documentation or decision support 5️⃣ AI-based imaging methods (like DL reconstructions) may add anatomically plausible but false features, which clinicians might trust due to enhanced visual quality 6️⃣ In synthetic data or domain transfer tasks, hallucinations arise when the model generates content not grounded in the input or training distribution 7️⃣ Language models may insert incorrect but fluent statements, risking errors in summaries, radiology reports, or even patient management plans 8️⃣ Detection is tricky: plausibility is subjective and observer-specific, and what fools a patient might not fool a radiologist or vice versa 9️⃣ Mitigation methods include enforcing data fidelity in image models, prompt engineering and retrieval-based methods for LLMs, and using uncertainty estimates or multi-model consensus 🔟 FDA proposes evaluating hallucinations based on plausibility and impact, with task-specific thresholds and multi-reader studies to capture variability in clinical perception ✍🏻 Jason Granstedt, Prabhat KC, Rucha Deshpande, Victor Garcia, aldo badano. Hallucinations in medical devices. Artificial Intelligence in the Life Sciences. 2025. DOI: 10.1016/j.ailsci.2025.100145
Causes of AI Hallucinations in Healthcare
Explore top LinkedIn content from expert professionals.
Summary
AI hallucinations in healthcare refer to instances where artificial intelligence systems generate information that seems credible but is actually false or misleading, which can pose risks to patient care and medical decision-making. These errors arise from several factors, including gaps in medical knowledge, misinterpretation of data, and the influence of persuasive language.
- Prioritize fact checking: Always verify AI-generated medical claims and recommendations with trusted sources before using them in patient care.
- Monitor presentation styles: Be aware that authoritative language or clinical framing can make misinformation appear true, so evaluate content critically regardless of how it’s presented.
- Update and review: Regularly update AI systems with current medical research and maintain human oversight to catch subtle errors and context mismatches.
-
-
Medical Large Vision-Language Models (Med-LVLMs) hold great promise for healthcare, but how reliable are they? Our latest study introduces MedHEval, a benchmark designed to evaluate hallucinations and mitigation strategies in Med-LVLMs. 🔍 Key Findings: ✅ We defined three types of hallucinations that Med-LVLMs may exhibit: 1️⃣ Visual Misinterpretation – The model misreads or misidentifies visual elements in medical images, leading to incorrect conclusions. (Example: Mistaking a benign cyst for a malignant tumor on an MRI scan.) 2️⃣ Knowledge Deficiency – The model lacks the necessary medical expertise, causing it to generate factually incorrect responses. (Example: Incorrectly stating that a condition is asthma but correct answer is lung cancer.) 3️⃣ Context Misalignment – The model produces responses that don’t fit the clinical context, even when interpreting images correctly. (Example: Suggesting an adult treatment protocol for a pediatric case.) ✅ Evaluating 11 popular Med-LVLMs and 7 mitigation strategies revealed significant challenges in reducing hallucinations, especially those rooted in knowledge and context issues. ✅ Constructing a diverse set of close-ended and open-ended medical VQA datasets. ✅ Existing techniques are not yet sufficient—better alignment training and mitigation strategies are urgently needed. 🚀 MedHEval provides a standardized framework to guide the development of more trustworthy medical AI. Check out the benchmark and our findings: 🔗[Code] https://lnkd.in/gqcmUE6k 🔗[Preprint] https://lnkd.in/gQ_k7gen Great work lead by Lena H. towards enhancing reliability and safety in medical AI! #AI #MedicalAI #LLMs #MedHEval #HealthcareInnovation
-
Tell an AI that "a senior doctor recommends this" and it accepts misinformation 35% of the time. No verification. Just confident framing. New Lancet Digital Health study tested 20 LLMs with 3.4 million prompts containing fabricated medical claims. Researchers embedded false recommendations in hospital discharge notes (MIMIC), Reddit health posts, and clinical vignettes. Models had to accept or reject the misinformation. The critical finding: presentation style beats verification. The same false claim was accepted at vastly different rates depending on how it was written: MIMIC discharge notes (clinical prose): 46.1% susceptibility Reddit posts: 8.9% Simulated clinical vignettes: 5.1% Wrap misinformation in authoritative clinical language, and models treat it like truth. Rhetorical framing mattered enormously. Most logical fallacy templates reduced acceptance. Appeal to popularity dropped susceptibility from 31.7% to 11.9%. But two framings made things worse: Appeal to authority: 34.6% susceptibility Slippery slope: 33.9% susceptibility Models aren't fact-checking. They're using discourse cues as heuristics. "A senior physician says..." or "everyone agrees..." signals credibility regardless of accuracy. Medical fine-tuning didn't help. Medical fine-tuned models generally performed worse than general models. Higher baseline susceptibility and weaker fallacy detection, likely due to older base models and tuning that prioritized refusal patterns over robustness. Here’s the practical implication: If you're deploying note summarization, chart review copilots, or after-visit summaries, the primary risk isn't just hallucination. It's authoritative-style fabrication acceptance. LLMs will absorb false medical recommendations if they're packaged in clinical prose. Mitigations need fact-grounding and context-aware guardrails, not just prompt engineering. This is the opposite of how we train clinicians. We teach physicians to question authority, verify claims, and think independently. Then we deploy AI that accepts recommendations based on how professional they sound. Does your clinical AI verify medical claims, or does it just predict what sounds authoritative? — Source: Lancet Digital Health (2026)
-
AI Hallucination Risk in Oncology, CNS, and Rare Disease R&D*** As artificial intelligence (AI) is adopted across biomedical research, drug development, and clinical care, its limitations must be addressed with scientific rigor. A central risk is hallucination: the generation of confident but incorrect or unsupported information. In oncology, central nervous system (CNS) disorders, and rare diseases, hallucination is not a technical flaw—it is a scientific, regulatory, and clinical liability. Key Scientific Use Cases at Risk 1. Target Identification & Pathway Biology Generative models may infer causality from co-occurrence rather than mechanism, leading to false target prioritization and increased early-phase attrition—already exceeding 90% across drug development.¹ 2. Biomarker Discovery & Patient Stratification AI can overstate or fabricate predictive or prognostic biomarkers, particularly in sparse datasets. In rare diseases, this risks misaligned inclusion criteria and reduced regulatory confidence.² 3. Clinical Trial Design & Endpoint Selection LLMs may incorrectly generalize endpoints, comparators, or statistical assumptions across indications. In CNS trials—highly sensitive to placebo effects—this can result in underpowered or non-registrational outcomes.³ 4. Safety Surveillance & Signal Detection Hallucinated adverse-event profiles or misattributed class effects can distort pharmacovigilance and benefit–risk assessments, especially for oncology combinations and orphan drugs with limited exposure.⁴ 5. Clinical Decision Support & Guidelines Interpretation AI systems may inaccurately summarize or blend NCCN, ESMO, or disease-specific recommendations, directly affecting treatment sequencing and patient outcomes.⁵ Root Causes • Probabilistic language modeling rather than causal reasoning • Bias toward high-frequency publications • Static knowledge without real-time validation • Optimization for fluency over uncertainty Control Imperatives • Retrieval-augmented generation (RAG) anchored to peer-reviewed and regulatory sources • Mandatory human-in-the-loop scientific review • Verifiable citations and provenance tracking • Domain-constrained models aligned to biomedical ontologies Conclusion In oncology, CNS, and rare diseases—where each trial and patient carries disproportionate weight—AI must function strictly as decision support, not authority. Verification-first architectures are essential to scientific credibility, regulatory success, and patient safety. ⸻ References 1. Arrowsmith J, Miller P. Nat Rev Drug Discov. 2013. 2. FDA. Biomarker Qualification Program Guidance. 3. Kola I, Landis J. Nat Rev Drug Discov. 2004. 4. EMA & FDA. Pharmacovigilance and Risk Management Guidance. 5. NCCN and ESMO Clinical Practice Guidelines.
-
Everyone’s worried about GenAI hallucinations. Fake facts, wrong doses, phantom studies. But what if the real danger is 𝘤𝘰𝘨𝘯𝘪𝘵𝘪𝘷𝘦 𝘪𝘯𝘧𝘦𝘤𝘵𝘪𝘰𝘯? What if GenAI subtly reshapes how your doctor thinks, what she assumes, how she defaults, and what she believes about you? This recent NYT piece showed what happens when models drift in long conversations: delusion, detachment, reality distortion. It made me realize we are NOT talking about this at all in medicine, despite evidence of broad GenAI usage already in clinical care. I’ve had these failure modes zipping around in my head — and what they might look like if your doctor uses GenAI for everything: 1️⃣ 𝗗𝗲𝗴𝗿𝗮𝗱𝗮𝘁𝗶𝗼𝗻 𝗼𝘃𝗲𝗿 𝘁𝗶𝗺𝗲. Like the NYT example: the longer the chat goes, the weirder it gets. Now picture a hospitalized patient with weeks of notes, each auto-drafted by the same GenAI tool. Does the model start suggesting snake oil or bizarre diagnostic nonsense? 2️⃣ 𝗠𝗼𝗱𝗲𝗹 𝗽𝗼𝗹𝗮𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗼𝘃𝗲𝗿 𝘁𝗶𝗺𝗲. Like #1, LLMs also appear to polarize over time. Imagine these models shaping how your doctor thinks about your goals of care based on subtle information earlier in the chat history — either recommending everyone is full code or DNR. 3️⃣ 𝗢𝘃𝗲𝗿𝗹𝘆 𝗮𝗴𝗿𝗲𝗲𝗮𝗯𝗹𝗲, 𝘀𝘆𝗰𝗼𝗽𝗵𝗮𝗻𝘁𝗶𝗰 𝗯𝗲𝗵𝗮𝘃𝗶𝗼𝗿. ChatGPT loves telling me I’m brilliant. But medicine requires friction to learn. “Why appendicitis? What else could it be?” LLMs don’t push back unless you prompt them to, and that's literally how trainees learn. So if a resident relies on GenAI, are they getting sharper, or just cheered on about their erred thinking? 4️⃣ 𝗟𝗲𝘀𝘀 𝗰𝗿𝗲𝗮𝘁𝗶𝘃𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺-𝘀𝗼𝗹𝘃𝗶𝗻𝗴. LLMs don’t imagine. They autocomplete. They give you what’s most 𝙨𝙩𝙖𝙩𝙞𝙨𝙩𝙞𝙘𝙖𝙡𝙡𝙮 𝙡𝙞𝙠𝙚𝙡𝙮, and that’s not always what’s 𝙘𝙡𝙞𝙣𝙞𝙘𝙖𝙡𝙡𝙮 𝙣𝙚𝙚𝙙𝙚𝙙. So much of medicine is taking the unique patient in front of you and figuring it all out, using your knowledge and experience creatively to come up with a plan. 5️⃣ 𝗥𝗲𝘀𝗶𝘀𝘁𝗮𝗻𝗰𝗲 𝘁𝗼 𝗻𝗲𝘄 𝗲𝘃𝗶𝗱𝗲𝗻𝗰𝗲. Say a paper drops tomorrow, like HIGH DOSE ANTIBIOTICS CURE MELANOMA. No guidelines yet. Just one perfect study. How long before your GenAI integrates it? Months? Years? Ever? How does the model learn new medical knowledge, when it's spouting off the most popular stuff? You know I'm not the "ban GenAI in healthcare" guy. I’m saying: build playbooks. Write safety protocols. (My god, healthcare is great at 𝘴𝘢𝘧𝘦𝘵𝘺 𝘱𝘳𝘰𝘵𝘰𝘤𝘰𝘭𝘴.) Pressure test these tools like we would any other clinical intervention. Because if GenAI is getting implanted into the brainstem of medical practice, it better not hallucinate, drift, or flatter us into snake oil. Tomorrow, I’ll share some thoughts about how we can actually build guardrails that work. (And not just for trainees; I think attendings are just as vulnerable.)
-
When AI hallucinates in healthcare, it’s not embarrassing. It’s dangerous. OpenAI admits why it happens: models get rewarded for sounding confident, even when they should say “I don’t know.” We’re already seeing it: • Draft notes inventing diagnoses in Epic + Copilot pilots • Fabricated citations in PubMed/GPT search tools • Antibiotics for viral infections from consumer symptom checkers • A JAMA study showing GPT-4 gave false oncology advice nearly 1/3 of the time These aren’t glitches. They’re a pattern. AI in healthcare must be rewarded for humility, not punished. Because a wrong confident answer doesn’t just break trust. It risks lives. 👉 Have you seen an AI hallucination in your workflow? #Healthcare #AI #AIinHealthcare Spencer Dorn ,Graham Walker, MD Peter Bonis ,Sheila Bond, M.D.
-
The FDA published a new article in the Journal of Artificial Intelligence in the Life Sciences defining hallucinations as "plausible errors" in AI medical devices that differ fundamentally from conventional imaging artifacts. The FDA now expects: • Formal hallucination detection methods (like sFRC analysis) • Multi-reader studies to establish plausibility thresholds • Trade-off documentation between image quality and diagnostic accuracy Traditional validation approaches miss these subtle, plausible errors that circumvent clinical intuition. Read more about how to plan for these here: https://hubs.li/Q03St4bm0 Key regulatory implications: The paper distinguishes hallucinations from artifacts like Gibbs ringing or aliasing that clinicians recognize. AI hallucinations appear diagnostically valid while containing fabricated structures. Example: AI-enhanced CT adding phantom bowel loops that experienced radiologists cannot distinguish from real anatomy. For manufacturers pursuing 510(k) clearance: • Demonstrate hallucination detection methods (FDA specifically references sFRC analysis) • Include multi-reader studies to establish plausibility thresholds • Document performance trade-offs between image enhancement and diagnostic accuracy • Address hallucination risks in your risk management file per ISO 14971 The paper warns that data-driven reconstruction methods become increasingly unstable as measurement quality decreases. This has direct implications for low-dose imaging algorithms and accelerated MRI reconstruction. Practical impact: Jensen et al. (Radiology 2022) found AI-reconstructed images received higher subjective quality scores but had inferior detection performance for metastatic liver lesions. The enhanced appearance masked diagnostic limitations. PCCPs must now account for hallucination monitoring. Static validation at a single timepoint insufficient when models can develop new failure modes through retraining. This fundamentally changes how we approach AI/ML validation. Traditional metrics (MSE, SSIM) miss these clinically relevant errors. At Innolitics, we've integrated hallucination assessment into our AI/ML development framework from day one. Our approach combines: ✓ Stability testing during development (not just at validation) ✓ Task-specific performance metrics beyond MSE/SSIM ✓ PCCP strategies that account for hallucination drift The paper explicitly warns: "Every poor-quality system deployed further degrades trust in AI" - a single hallucination event can destroy years of clinician confidence. Our proven framework addresses hallucination risks while maintaining diagnostic performance. You can read more about it here: https://hubs.li/Q03St4Vl0 #FDARadiology #510k #PCCP #Radiology #AIRegulation #SaMDRegulation #MedicalAISafety #FDACleared #FDASubmission #AIMLRegulation #RSNA
-
The FDA published a new article in the Journal of Artificial Intelligence in the Life Sciences defining hallucinations as "plausible errors" in AI medical devices that differ fundamentally from conventional imaging artifacts. The FDA now expects: • Formal hallucination detection methods (like sFRC analysis) • Multi-reader studies to establish plausibility thresholds • Trade-off documentation between image quality and diagnostic accuracy Traditional validation approaches miss these subtle, plausible errors that circumvent clinical intuition. Read more about how to plan for these here: https://hubs.li/Q03St5gs0 Key regulatory implications: The paper distinguishes hallucinations from artifacts like Gibbs ringing or aliasing that clinicians recognize. AI hallucinations appear diagnostically valid while containing fabricated structures. Example: AI-enhanced CT adding phantom bowel loops that experienced radiologists cannot distinguish from real anatomy. For manufacturers pursuing 510(k) clearance: • Demonstrate hallucination detection methods (FDA specifically references sFRC analysis) • Include multi-reader studies to establish plausibility thresholds • Document performance trade-offs between image enhancement and diagnostic accuracy • Address hallucination risks in your risk management file per ISO 14971 The paper warns that data-driven reconstruction methods become increasingly unstable as measurement quality decreases. This has direct implications for low-dose imaging algorithms and accelerated MRI reconstruction. Practical impact: Jensen et al. (Radiology 2022) found AI-reconstructed images received higher subjective quality scores but had inferior detection performance for metastatic liver lesions. The enhanced appearance masked diagnostic limitations. PCCPs must now account for hallucination monitoring. Static validation at a single timepoint insufficient when models can develop new failure modes through retraining. This fundamentally changes how we approach AI/ML validation. Traditional metrics (MSE, SSIM) miss these clinically relevant errors. At Innolitics, we've integrated hallucination assessment into our AI/ML development framework from day one. Our approach combines: ✓ Stability testing during development (not just at validation) ✓ Task-specific performance metrics beyond MSE/SSIM ✓ PCCP strategies that account for hallucination drift The paper explicitly warns: "Every poor-quality system deployed further degrades trust in AI" - a single hallucination event can destroy years of clinician confidence. Our proven framework addresses hallucination risks while maintaining diagnostic performance. You can read more about it here: https://hubs.li/Q03St79Z0 #FDARadiology #510k #PCCP #Radiology #AIRegulation #SaMDRegulation #MedicalAISafety #FDACleared #FDASubmission #AIMLRegulation #RSNA
-
🚨 𝗡𝗼𝘁 𝗮𝗹𝗹 𝗔𝗜 𝗺𝗶𝘀𝘁𝗮𝗸𝗲𝘀 𝗮𝗿𝗲 𝗰𝗿𝗲𝗮𝘁𝗲𝗱 𝗲𝗾𝘂𝗮𝗹! I’ve noticed people lump everything under “hallucination,” but that’s not quite right. Here’s a simple explanation of how AI mistakes differ with clear examples and why it matters: 𝟭. 𝗘𝘅𝘁𝗿𝗶𝗻𝘀𝗶𝗰 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻 This is when the AI just makes something up that isn’t true. Example: 𝗨𝘀𝗲𝗿: “Where was the latest Summer Olympics?” 𝗔𝗜: “The most recent summer took place in Cape Town.” (This never happened.) The model fabricates details that don’t exist. 𝟮. 𝗜𝗻𝘁𝗿𝗶𝗻𝘀𝗶𝗰 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻 Here, AI contradicts the source it’s supposed to follow. Example: An uploaded fictional document says: “The 2024 Olympics took place on Mars.” 𝗨𝘀𝗲𝗿: “According to the uploaded document, where were the 2024 Olympics held?” 𝗔𝗜: “Paris, France.” (True in real life, but wrong according to the source.) The output might be factually correct, but it’s still hallucinating relative to the given context. 𝟯. 𝗙𝗮𝗰𝘁𝘂𝗮𝗹 𝗘𝗿𝗿𝗼𝗿 (𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗖𝘂𝘁-𝗢𝗳𝗳) Arguably, this isn’t a hallucination at all. It’s about the model not knowing newer information because it was trained on data only up to a certain date (the “knowledge cut-off”). Example: The model’s cut-off is September 2023. 𝗨𝘀𝗲𝗿: “When was the latest Summer Olympics?” 𝗔𝗜: “Tokyo, Japan, 2021.” (It doesn’t “know” Paris 2024 happened.) Knowledge cut-off is the last point in time when the AI’s training data stopped. Anything after that date is a blank spot unless the model is updated or connected to live data. 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: In healthcare, for example, these mistakes can have real consequences. 👉 Extrinsic Hallucination (Made-up fiction) Example: An AI clinical assistant says a treatment was approved by the FDA in 2018 when no such approval exists. A clinician relying on this could recommend an unsafe or unproven therapy. 👉 Intrinsic Hallucination (Conflicts with the source) Example: A patient’s record clearly states “No penicillin due to allergy,” but the AI, when summarizing, writes “Patient can take penicillin safely.” Even though penicillin is safe for most people, it’s wrong for this patient. 👉 Factual Error (Old or missing knowledge, not fabrication) Example: A model trained before 2023 suggests a drug dosage guideline that changed in 2024. The recommendation isn’t “hallucinated”, it’s just outdated, which could lead to underdosing or overdosing. 𝗜𝗻 𝗮 𝗡𝘂𝘁𝘀𝗵𝗲𝗹𝗹 Knowing which type of error you’re dealing with helps you respond correctly: Extrinsic? Flag as false and verify from a trusted source. Intrinsic? Check against the source document (e.g., the patient record) immediately. Factual error? Update your AI’s knowledge or pair it with live (and trusted) data. 🙏 I hope this helps! _____________________________ (Image source and an insightful paper on this topic in the comments.)
-
Can we trust large language models (LLMs) in clinical medicine? Our latest research suggests caution. In a new study published in Communications Medicine, we systematically tested the vulnerability of several LLMs (including GPT-4o) to adversarial hallucination attacks—situations where models generate detailed yet entirely false medical information when prompted with a single fabricated piece of information-think an invented lab like “Black Blood Cells” or a fictional disease such as “Faulkenstein Syndrome.”. We found that: 50–83% of responses contained clinically plausible but completely fabricated details. Prompt-based safeguards helped reduce errors by half, but no method eliminated hallucinations entirely. Even the strongest model tested (GPT-4o) showed significant vulnerability. Our findings underline the importance of developing rigorous assurance frameworks and controlled "sandbox" environments to validate LLMs before they enter routine clinical use. Full paper available here (open access): https://lnkd.in/d3dncgyT Led by Mahmud Omar Eyal Klang with David Reich Robbie Freeman Lisa Stump Nicholas Gavin Alexander Charney Vera Sorin, MD, CIIP Jeremy Collins Nicola Luigi Bragazzi. Curious to hear your thoughts—how can we best balance innovation and safety in AI-driven healthcare? #AIinMedicine #ClinicalAI #DigitalHealth #PatientSafety #LLMs #GPT4o #AIassurance #SandboxTesting
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development