Related papers: LLM Robustness Against Misinformation in Biomedical Question Answering

LLM Robustness Against Misinformation in Biomedical Question Answering

URL: http://arxiv.org/abs/2410.21330v1
Date: Sun, 27 Oct 2024 16:23:26 GMT
Title: LLM Robustness Against Misinformation in Biomedical Question Answering
Authors: Alexander Bondarenko, Adrian Viehweger,
Abstract summary: The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
Score: 50.98256373698759
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering by retrieving and providing additional context coming from external knowledge sources (e.g., by adding the context to the prompt). However, injecting incorrect information can mislead the LLM to generate an incorrect answer. In this paper, we evaluate the effectiveness and robustness of four LLMs against misinformation - Gemma 2, GPT-4o-mini, Llama~3.1, and Mixtral - in answering biomedical questions. We assess the answer accuracy on yes-no and free-form questions in three scenarios: vanilla LLM answers (no context is provided), "perfect" augmented generation (correct context is provided), and prompt-injection attacks (incorrect context is provided). Our results show that Llama 3.1 (70B parameters) achieves the highest accuracy in both vanilla (0.651) and "perfect" RAG (0.802) scenarios. However, the accuracy gap between the models almost disappears with "perfect" RAG, suggesting its potential to mitigate the LLM's size-related effectiveness differences. We further evaluate the ability of the LLMs to generate malicious context on one hand and the LLM's robustness against prompt-injection attacks on the other hand, using metrics such as attack success rate (ASR), accuracy under attack, and accuracy drop. As adversaries, we use the same four LLMs (Gemma 2, GPT-4o-mini, Llama 3.1, and Mixtral) to generate incorrect context that is injected in the target model's prompt. Interestingly, Llama is shown to be the most effective adversary, causing accuracy drops of up to 0.48 for vanilla answers and 0.63 for "perfect" RAG across target models. Our analysis reveals that robustness rankings vary depending on the evaluation measure, highlighting the complexity of assessing LLM resilience to adversarial attacks.

Related papers

Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs [5.161416961439468]
This study proposes an effective uncertainty estimation approach, textbfClusttextbfering-based semtextbfantic contextbfsisttextbfency (textbfCleanse)<n>The effectiveness of Cleanse for detecting hallucination is validated using four off-the-shelf models, LLaMA-7B, LLaMA-13B, LLaMA2-7B and Mistral-7B.
arXiv Detail & Related papers (2025-07-19T14:48:24Z)
Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge [7.064104563689608]
Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction.<n>This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction.
arXiv Detail & Related papers (2025-06-01T02:01:52Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models [8.846200844870767]
We discover an understudied type of undesirable behavior of Large Language Models (LLMs) We term Verbosity Compensation (VC) as similar to the hesitation behavior of humans under uncertainty. We propose a simple yet effective cascade algorithm that replaces verbose responses with the other model-generated responses.
arXiv Detail & Related papers (2024-11-12T15:15:20Z)
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z)
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. Our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z)
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs) In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z)
Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing [37.400757839157116]
Large Language Models (LLMs) have achieved state-of-the-art performance at zero-shot generation of abstractive summaries for given articles. We propose relevance paraphrasing, a simple strategy that can be used to measure the robustness of LLMs as summarizers.
arXiv Detail & Related papers (2024-06-06T12:08:43Z)
Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs [9.624124576891075]
Existing alignment methods can lead large language models (LLMs) to be Adaptive Chameleons when external evidence conflicts with their parametric memory. We propose a novel framework: Dialectical Alignment (DA), which utilizes AI feedback to identify optimal strategies for LLMs to navigate inter-context conflicts. Our experiments show that the DA improves poisoned data attack defense by 20 and does not require any additional prompt engineering.
arXiv Detail & Related papers (2024-03-30T22:41:05Z)
Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions. Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions. To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z)
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration [32.15773300068426]
Membership Inference Attacks aim to infer whether a target data record has been utilized for model training. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA)
arXiv Detail & Related papers (2023-11-10T13:55:05Z)
Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness [16.35655151252159]
We introduce BSDetector, a method for detecting bad and speculative answers from a pretrained Large Language Model. Our uncertainty quantification technique works for any LLM accessible only via a black-box API.
arXiv Detail & Related papers (2023-08-30T17:53:25Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers? We propose KaRR, a statistical approach to assess factual knowledge for LLMs. Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.