OLAPH: Improving Factuality in Biomedical Long-form Question Answering
- URL: http://arxiv.org/abs/2405.12701v1
- Date: Tue, 21 May 2024 11:50:16 GMT
- Title: OLAPH: Improving Factuality in Biomedical Long-form Question Answering
- Authors: Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, Jaewoo Kang,
- Abstract summary: We introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain.
We also propose OLAPH, a simple and novel framework that enables the improvement of factuality through automatic evaluations.
Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality.
- Score: 15.585833125854418
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate the automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that enables the improvement of factuality through automatic evaluations. The OLAPH framework iteratively trains LLMs to mitigate hallucinations using sampling predictions and preference optimization. In other words, we iteratively set the highest-scoring response as a preferred response derived from sampling predictions and train LLMs to align with the preferred response that improves factuality. We highlight that, even on evaluation metrics not used during training, LLMs trained with our OLAPH framework demonstrate significant performance improvement in factuality. Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality. We believe that our work could shed light on gauging the long-text generation ability of LLMs in the medical domain. Our code and datasets are available at https://github.com/dmis-lab/OLAPH}{https://github.com/dmis-lab/OLAPH.
Related papers
- The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation [1.2839205715237014]
Large Language Models (LLMs) have the potential to significantly improve personal health management for chronic conditions.
LLMs generate responses based on patterns learned from diverse internet data.
Retrieval Augmented Generation (RAG) can help mitigate hallucinations and inaccuracies in RAG responses.
arXiv Detail & Related papers (2024-07-25T13:47:01Z) - How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions [0.0]
Large language models (LLMs) have recently become the leading source of answers for users' questions online.
Despite their ability to offer eloquent answers, their accuracy and reliability can pose a significant challenge.
This paper introduces a biomedical retrieval-augmented generation (RAG) system designed to enhance the reliability of generated responses.
arXiv Detail & Related papers (2024-07-06T09:10:05Z) - Enhancing Biomedical Knowledge Retrieval-Augmented Generation with Self-Rewarding Tree Search and Proximal Policy Optimization [50.26966969163348]
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG)
Existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries.
We propose Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm.
arXiv Detail & Related papers (2024-06-17T06:48:31Z) - Crafting Interpretable Embeddings by Asking LLMs Questions [89.49960984640363]
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks.
We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM.
We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli.
arXiv Detail & Related papers (2024-05-26T22:30:29Z) - Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models [10.04914417538886]
Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment.
We propose a new textitDistill-Retrieve-Read framework instead of the previous textitRetrieve-then-Read.
arXiv Detail & Related papers (2024-04-27T13:11:42Z) - Few shot chain-of-thought driven reasoning to prompt LLMs for open ended
medical question answering [25.163347677278182]
We propose a modified version of the MedQA-USMLE dataset, which is subjective, to mimic real-life clinical scenarios.
We develop better in-contrast learning strategies by modifying the 5-shot-codex-CoT-prompt from arXiv:2207.08143 for the subjective MedQA dataset and developing our incremental-reasoning prompt.
arXiv Detail & Related papers (2024-03-07T20:48:40Z) - Large Language Model Distilling Medication Recommendation Model [61.89754499292561]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs)
Our research aims to transform existing medication recommendation methodologies using LLMs.
To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering [42.528771319248214]
Large Language Models (LLMs) often perform poorly on domain-specific tasks like medical question answering (QA)
We propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the query prompt for LLMs.
Our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%.
arXiv Detail & Related papers (2023-09-27T21:26:03Z) - Augmenting Black-box LLMs with Medical Textbooks for Clinical Question
Answering [54.13933019557655]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.