SIFiD: Reassess Summary Factual Inconsistency Detection with LLM
- URL: http://arxiv.org/abs/2403.07557v1
- Date: Tue, 12 Mar 2024 11:41:51 GMT
- Title: SIFiD: Reassess Summary Factual Inconsistency Detection with LLM
- Authors: Jiuding Yang, Hui Liu, Weidong Guo, Zhuwei Rao, Yu Xu, Di Niu
- Abstract summary: This study reassesses summary inconsistency detection with Large Language Models (LLMs)
We propose SIFiD (Summary Inconsistency Detection with Filtered Document) that identify key sentences within documents by either employing natural language inference or measuring semantic similarity between summaries and documents.
- Score: 27.392514180175283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensuring factual consistency between the summary and the original document is
paramount in summarization tasks. Consequently, considerable effort has been
dedicated to detecting inconsistencies. With the advent of Large Language
Models (LLMs), recent studies have begun to leverage their advanced language
understanding capabilities for inconsistency detection. However, early attempts
have shown that LLMs underperform traditional models due to their limited
ability to follow instructions and the absence of an effective detection
methodology. In this study, we reassess summary inconsistency detection with
LLMs, comparing the performances of GPT-3.5 and GPT-4. To advance research in
LLM-based inconsistency detection, we propose SIFiD (Summary Inconsistency
Detection with Filtered Document) that identify key sentences within documents
by either employing natural language inference or measuring semantic similarity
between summaries and documents.
Related papers
- Large Language Models for Anomaly Detection in Computational Workflows: from Supervised Fine-Tuning to In-Context Learning [9.601067780210006]
This paper leverages large language models (LLMs) for workflow anomaly detection by exploiting their ability to learn complex data patterns.
Two approaches are investigated: 1) supervised fine-tuning (SFT), where pre-trained LLMs are fine-tuned on labeled data for sentence classification to identify anomalies, and 2) in-context learning (ICL) where prompts containing task descriptions and examples guide LLMs in few-shot anomaly detection without fine-tuning.
arXiv Detail & Related papers (2024-07-24T16:33:04Z) - Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends [38.86240794422485]
We evaluate the faithfulness of large language models for dialogue summarization.
Our evaluation reveals subtleties as to what constitutes a hallucination.
We introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics.
arXiv Detail & Related papers (2024-06-05T17:49:47Z) - Identifying Factual Inconsistencies in Summaries: Grounding LLM Inference via Task Taxonomy [48.29181662640212]
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models.
We consolidate key error types of inconsistent facts in summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs.
arXiv Detail & Related papers (2024-02-20T08:41:23Z) - Improving Factual Consistency of Text Summarization by Adversarially
Decoupling Comprehension and Embellishment Abilities of LLMs [67.56087611675606]
Large language models (LLMs) generate summaries that are factually inconsistent with original articles.
These hallucinations are challenging to detect through traditional methods.
We propose an adversarially DEcoupling method to disentangle the abilities of LLMs (DECENT)
arXiv Detail & Related papers (2023-10-30T08:40:16Z) - SeqXGPT: Sentence-Level AI-Generated Text Detection [62.3792779440284]
We introduce a sentence-level detection challenge by synthesizing documents polished with large language models (LLMs)
We then propose textbfSequence textbfX (Check) textbfGPT, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection.
arXiv Detail & Related papers (2023-10-13T07:18:53Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - Semantic Consistency for Assuring Reliability of Large Language Models [9.876355290198639]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks.
We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs.
We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Evaluating Factual Consistency of Summaries with Large Language Models [24.416837319515896]
We explore evaluating factual consistency of summaries by directly prompting large language models (LLMs)
Our experiments demonstrate that prompting LLMs is able to outperform the previous best factuality systems in all settings.
arXiv Detail & Related papers (2023-05-23T13:48:32Z) - Evaluating the Factual Consistency of Large Language Models Through News
Summarization [97.04685401448499]
We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization.
For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent.
For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
arXiv Detail & Related papers (2022-11-15T18:50:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.