Minimizing Factual Inconsistency and Hallucination in Large Language
Models
- URL: http://arxiv.org/abs/2311.13878v1
- Date: Thu, 23 Nov 2023 09:58:39 GMT
- Title: Minimizing Factual Inconsistency and Hallucination in Large Language
Models
- Authors: Muneeswaran I, Shreya Saxena, Siva Prasad, M V Sai Prakash, Advaith
Shankar, Varun V, Vishal Vaddina, Saisubramaniam Gopalakrishnan
- Abstract summary: Large Language Models (LLMs) are widely used in critical fields such as healthcare, education, and finance.
We propose a multi-stage framework that generates the rationale first, verifies and refines incorrect ones, and uses them as supporting references to generate the answer.
Our framework improves traditional Retrieval Augmented Generation (RAG) by enabling OpenAI GPT-3.5-turbo to be 14-25% more faithful and 16-22% more accurate on two datasets.
- Score: 0.16417409087671928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are widely used in critical fields such as
healthcare, education, and finance due to their remarkable proficiency in
various language-related tasks. However, LLMs are prone to generating factually
incorrect responses or "hallucinations," which can lead to a loss of
credibility and trust among users. To address this issue, we propose a
multi-stage framework that generates the rationale first, verifies and refines
incorrect ones, and uses them as supporting references to generate the answer.
The generated rationale enhances the transparency of the answer and our
framework provides insights into how the model arrived at this answer, by using
this rationale and the references to the context. In this paper, we demonstrate
its effectiveness in improving the quality of responses to drug-related
inquiries in the life sciences industry. Our framework improves traditional
Retrieval Augmented Generation (RAG) by enabling OpenAI GPT-3.5-turbo to be
14-25% more faithful and 16-22% more accurate on two datasets. Furthermore,
fine-tuning samples based on our framework improves the accuracy of smaller
open-access LLMs by 33-42% and competes with RAG on commercial models.
Related papers
- Rephrase and Contrast: Fine-Tuning Language Models for Enhanced Understanding of Communication and Computer Networks [13.829525575305206]
This paper introduces our Rephrase and Contrast (RaC) framework, an efficient fine-tuning framework.
RaC enhances LLMs' comprehension and critical thinking abilities by incorporating question reformulation and contrastive analysis.
To efficiently construct the dataset for RaC fine-tuning, we develop a GPT-assisted data mining method for generating high-quality question-answer pairs.
arXiv Detail & Related papers (2024-09-21T16:04:43Z) - Graph Retrieval Augmented Trustworthiness Reasoning [1.1660282484277826]
We introduce the Graph Retrieval Augmented Reasoning (GRATR) framework to bolster trustworthiness reasoning in agents.
GRATR constructs a dynamic trustworthiness graph, updating it in real-time with evidential information.
Our results demonstrate GRATR surpasses the baseline methods by over 30% in winning rate, with superior reasoning performance.
arXiv Detail & Related papers (2024-08-22T12:21:22Z) - Improving Retrieval Augmented Language Model with Self-Reasoning [20.715106330314605]
We propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs.
The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process.
We have evaluated our framework across four public datasets to demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-07-29T09:05:10Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Fine-Grained Self-Endorsement Improves Factuality and Reasoning [72.83651220132495]
This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations.
We propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses.
arXiv Detail & Related papers (2024-02-23T22:24:40Z) - Enhancing Large Language Model Performance To Answer Questions and
Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions.
Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions.
To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z) - Mitigating Large Language Model Hallucinations via Autonomous Knowledge
Graph-based Retrofitting [51.7049140329611]
This paper proposes Knowledge Graph-based Retrofitting (KGR) to mitigate factual hallucination during the reasoning process.
Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks.
arXiv Detail & Related papers (2023-11-22T11:08:38Z) - Towards Reliable and Fluent Large Language Models: Incorporating
Feedback Learning Loops in QA Systems [10.58737969057445]
We build a dataset to train a critic model capable of evaluating the citation, correctness, and fluency of responses generated by large language models.
We propose an automated feedback mechanism that leverages the critic model to offer real-time feedback on heterogeneous aspects of generated text.
Experimental results demonstrate the efficacy of our approach, including a 4% precision increase in citation and an approximately 8% enhancement in the MAUVE metric for fluency.
arXiv Detail & Related papers (2023-09-08T09:39:53Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models.
We analyze the performance of different large language models in 4 fundamental abilities required for RAG.
We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.