Answering real-world clinical questions using large language model based systems
0
💬
Sign in to get full access
Overview
- The paper explores the potential of large language models (LLMs) to address the challenges of limited and difficult-to-contextualize healthcare research literature.
- The researchers evaluated the ability of five LLM-based systems to answer 50 clinical questions, with nine physicians reviewing the responses for relevance, reliability, and actionability.
- The results suggest that while general-purpose LLMs perform poorly, purpose-built systems combining retrieval-augmented generation (RAG) and novel evidence generation could significantly improve the availability of pertinent evidence for patient care.
Plain English Explanation
Healthcare providers often struggle to find the right information to guide their decisions for individual patients. This is because the existing research literature may not be directly relevant or trustworthy enough. Large language models (LLMs) could potentially help by either summarizing the published research or generating new studies based on real-world data.
The researchers in this study tested the ability of five different LLM-based systems to answer 50 clinical questions. They had nine experienced doctors review the responses to see how relevant, reliable, and useful they were for making healthcare decisions.
The general-purpose LLMs, like ChatGPT-4 and Claude 3 Opus, performed poorly, with only 2-10% of their answers being considered relevant and evidence-based. However, the systems that combined retrieval-augmented generation (RAG) and the ability to generate novel evidence, like ChatRWD, did much better, producing relevant and evidence-based answers for 24-58% of the questions.
Importantly, the ChatRWD system was even able to answer completely new questions, which the other LLMs could not do. This suggests that a combination of summarizing existing research and generating new evidence could greatly improve the availability of useful information for healthcare providers.
Technical Explanation
The researchers evaluated the performance of five LLM-based systems in answering 50 clinical questions:
- ChatGPT-4, a general-purpose LLM
- Claude 3 Opus, another general-purpose LLM
- OpenEvidence, a retrieval-augmented generation (RAG) system
- ChatRWD, an agentic LLM that can generate novel evidence
- Gemini Pro 1.5, a general-purpose LLM
Nine independent physicians reviewed the responses from these systems and assessed them for relevance, reliability, and actionability. The results showed that the general-purpose LLMs (ChatGPT-4, Claude 3 Opus, and Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2-10%).
In contrast, the RAG-based and agentic LLM systems performed much better. OpenEvidence produced relevant and evidence-based answers for 24% of the questions, while ChatRWD was able to do so for 58% of the questions.
Importantly, only the ChatRWD system was able to answer novel questions that were not covered in the existing research literature, doing so for 65% of the questions. The other LLMs were limited to 0-9% in this regard.
Critical Analysis
The paper highlights the limitations of general-purpose LLMs in the healthcare domain and the potential benefits of more specialized systems that combine research summarization and novel evidence generation.
However, the study does not provide detailed information on the specific architectures, training data, or other technical details of the LLM-based systems tested. This makes it difficult to fully assess the generalizability of the findings or to understand the trade-offs and design choices that led to the observed performance differences.
Additionally, the study only evaluated the systems on a relatively small set of 50 clinical questions, which may not be representative of the full range of challenges healthcare providers face. Comprehensive surveys of LLMs in healthcare and medicine could provide a more holistic understanding of the current state of the field and the potential future directions.
Conclusion
This study suggests that while general-purpose LLMs are not yet ready to be used directly in healthcare decision-making, a combination of research summarization and novel evidence generation could significantly improve the availability of relevant and trustworthy information for patient care.
By developing purpose-built LLM systems that can effectively leverage real-world data and existing literature, healthcare providers may be able to access the evidence they need to make more informed decisions, ultimately leading to better patient outcomes.
This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!
Related Papers
💬
0
Answering real-world clinical questions using large language model based systems
Yen Sia Low (Atropos Health, New York NY, USA), Michael L. Jackson (Atropos Health, New York NY, USA), Rebecca J. Hyde (Atropos Health, New York NY, USA), Robert E. Brown (Atropos Health, New York NY, USA), Neil M. Sanghavi (Atropos Health, New York NY, USA), Julian D. Baldwin (Atropos Health, New York NY, USA), C. William Pike (Atropos Health, New York NY, USA), Jananee Muralidharan (Atropos Health, New York NY, USA), Gavin Hui (Atropos Health, New York NY, USA, Department of Medicine, University of California, Los Angeles CA, USA), Natasha Alexander (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Hadeel Hassan (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Rahul V. Nene (Department of Emergency Medicine, University of California, San Diego CA, USA), Morgan Pike (Department of Emergency Medicine, University of Michigan, Ann Arbor MI, USA), Courtney J. Pokrzywa (Department of Surgery, Columbia University, New York NY, USA), Shivam Vedak (Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA), Adam Paul Yan (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Dong-han Yao (Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA), Amy R. Zipursky (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Christina Dinh (Atropos Health, New York NY, USA), Philip Ballentine (Atropos Health, New York NY, USA), Dan C. Derieg (Atropos Health, New York NY, USA), Vladimir Polony (Atropos Health, New York NY, USA), Rehan N. Chawdry (Atropos Health, New York NY, USA), Jordan Davies (Atropos Health, New York NY, USA), Brigham B. Hyde (Atropos Health, New York NY, USA), Nigam H. Shah (Atropos Health, New York NY, USA, Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA), Saurabh Gombar (Atropos Health, New York NY, USA, Department of Pathology, Stanford University, Stanford CA, USA)
Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.
Read more7/2/2024
🌿
0
Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models
Akhil Vaid, Joshua Lampert, Juhee Lee, Ashwin Sawant, Donald Apakama, Ankit Sakhuja, Ali Soroush, Sarah Bick, Ethan Abbott, Hernando Gomez, Michael Hadley, Denise Lee, Isotta Landi, Son Q Duong, Nicole Bussola, Ismail Nabeel, Silke Muehlstedt, Silke Muehlstedt, Robert Freeman, Patricia Kovatch, Brendan Carr, Fei Wang, Benjamin Glicksberg, Edgar Argulian, Stamatios Lerakis, Rohan Khera, David L. Reich, Monica Kraft, Alexander Charney, Girish Nadkarni
Generative Large Language Models (LLMs) hold significant promise in healthcare, demonstrating capabilities such as passing medical licensing exams and providing clinical knowledge. However, their current use as information retrieval tools is limited by challenges like data staleness, resource demands, and occasional generation of incorrect information. This study assessed the potential of LLMs to function as autonomous agents in a simulated tertiary care medical center, using real-world clinical cases across multiple specialties. Both proprietary and open-source LLMs were evaluated, with Retrieval Augmented Generation (RAG) enhancing contextual relevance. Proprietary models, particularly GPT-4, generally outperformed open-source models, showing improved guideline adherence and more accurate responses with RAG. The manual evaluation by expert clinicians was crucial in validating models' outputs, underscoring the importance of human oversight in LLM operation. Further, the study emphasizes Natural Language Programming (NLP) as the appropriate paradigm for modifying model behavior, allowing for precise adjustments through tailored prompts and real-world interactions. This approach highlights the potential of LLMs to significantly enhance and supplement clinical decision-making, while also emphasizing the value of continuous expert involvement and the flexibility of NLP to ensure their reliability and effectiveness in healthcare settings.
Read more8/23/2024
💬
0
Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation
Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, Bahador Saket
This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q&A) systems, a crucial form of Natural Language Processing. Traditionally, human evaluation has been indispensable for assessing the quality of these responses. However, manual evaluation by medical professionals is time-consuming and costly. Our study examines whether LLMs can reliably replicate human evaluations by using questions derived from patient data, thereby saving valuable time for medical experts. While the findings suggest promising results, further research is needed to address more specific or complex questions that were beyond the scope of this initial investigation.
Read more9/4/2024
0
How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions
Bojana Bav{s}aragin, Adela Ljaji'c, Darija Medvecki, Lorenzo Cassano, Milov{s} Kov{s}prdi'c, Nikola Milov{s}evi'c
Large language models (LLMs) have recently become the leading source of answers for users' questions online. Despite their ability to offer eloquent answers, their accuracy and reliability can pose a significant challenge. This is especially true for sensitive domains such as biomedicine, where there is a higher need for factually correct answers. This paper introduces a biomedical retrieval-augmented generation (RAG) system designed to enhance the reliability of generated responses. The system is based on a fine-tuned LLM for the referenced question-answering, where retrieved relevant abstracts from PubMed are passed to LLM's context as input through a prompt. Its output is an answer based on PubMed abstracts, where each statement is referenced accordingly, allowing the users to verify the answer. Our retrieval system achieves an absolute improvement of 23% compared to the PubMed search engine. Based on the manual evaluation on a small sample, our fine-tuned LLM component achieves comparable results to GPT-4 Turbo in referencing relevant abstracts. We make the dataset used to fine-tune the models and the fine-tuned models based on Mistral-7B-instruct-v0.1 and v0.2 publicly available.
Read more7/9/2024