Related papers: GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

URL: http://arxiv.org/abs/2402.12566v2
Date: Sat, 16 Mar 2024 21:14:16 GMT
Title: GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence
Authors: Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, Jeffrey P. Bigham,
Abstract summary: We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision.
Score: 64.95492752484171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision. We release our tool (GenAudit) and fact-checking model for public use.

Related papers

On the Capacity of Citation Generation by Large Language Models [38.47160164251295]
Retrieval-augmented generation (RAG) appears as a promising method to alleviate the "hallucination" problem in large language models (LLMs)
arXiv Detail & Related papers (2024-10-15T03:04:26Z)
RETAIN: Interactive Tool for Regression Testing Guided LLM Migration [8.378294455013284]
RETAIN (REgression Testing guided LLM migrAtIoN) is a tool designed explicitly for regression testing in LLM Migrations. Our automatic evaluation and empirical user studies demonstrate that RETAIN, when compared to manual evaluation, enabled participants to identify twice as many errors, facilitated experimentation with 75% more prompts, and achieves 12% higher metric scores in a given time frame.
arXiv Detail & Related papers (2024-09-05T22:22:57Z)
CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG [50.030526904378256]
Retrieval-Augmented Generation (RAG) can alleviate hallucinations of Large Language Models (LLMs) by referencing external documents. To address this issue, we explore the task of "credibility-aware RAG" We introduce a plug-and-play method named $textbfCr$edibility-aware $textbfA$ttention $textbfM$odification (CrAM) CrAM identifies influential attention heads and adjusts their attention weights based on the credibility of the documents, thereby reducing the impact of low-credibility documents.
arXiv Detail & Related papers (2024-06-17T13:01:12Z)
FactCheck Editor: Multilingual Text Editor with End-to-End fact-checking [1.985242455423935]
'FactCheck Editor' is an advanced text editor designed to automate fact-checking and correct factual inaccuracies. It supports over 90 languages and utilizes transformer models to assist humans in the labor-intensive process of fact verification.
arXiv Detail & Related papers (2024-04-30T11:55:20Z)
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [62.02920842630234]
We show how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors. For evaluation, we unify datasets from recent work on fact-checking and grounding LLM generations into a new benchmark, LLM-AggreFact.
arXiv Detail & Related papers (2024-04-16T17:59:10Z)
LLMAuditor: A Framework for Auditing Large Language Models Using Human-in-the-Loop [7.77005079649294]
An effective method is to probe the Large Language Models using different versions of the same question. To operationalize this auditing method at scale, we need an approach to create those probes reliably and automatically. We propose the LLMAuditor framework, where one uses a different LLM along with human-in-the-loop (HIL) This approach offers verifiability and transparency, while avoiding circular reliance on the same LLM.
arXiv Detail & Related papers (2024-02-14T17:49:31Z)
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses. We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)
LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents. We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question. Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z)
Knowledge-Augmented Language Model Verification [68.6099592486075]
Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters. We propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier. Our results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs.
arXiv Detail & Related papers (2023-10-19T15:40:00Z)
Beyond the Chat: Executable and Verifiable Text-Editing with LLMs [87.84199761550634]
Conversational interfaces powered by Large Language Models (LLMs) have recently become a popular way to obtain feedback during document editing. We present InkSync, an editing interface that suggests executable edits directly within the document being edited.
arXiv Detail & Related papers (2023-09-27T00:56:17Z)
Factual Error Correction of Claims [18.52583883901634]
This paper introduces the task of factual error correction. It provides a mechanism to correct written texts that contain misinformation. It acts as an inherent explanation for claims already partially supported by evidence.
arXiv Detail & Related papers (2020-12-31T18:11:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.