A Claim Decomposition Benchmark for Long-form Answer Verification
- URL: http://arxiv.org/abs/2410.12558v1
- Date: Wed, 16 Oct 2024 13:34:51 GMT
- Title: A Claim Decomposition Benchmark for Long-form Answer Verification
- Authors: Zhihao Zhang, Yixing Fan, Ruqing Zhang, Jiafeng Guo,
- Abstract summary: We introduce a new claim decomposition benchmark, which requires building system that can identify atomic and checkworthy claims for LLM responses.
The CACDD encompasses a collection of 500 human-annotated question-answer pairs, including a total of 4956 atomic claims.
Results show that the claim decomposition is highly challenging and requires further explorations.
- Score: 42.27949634354242
- License:
- Abstract: The advancement of LLMs has significantly boosted the performance of complex long-form question answering tasks. However, one prominent issue of LLMs is the generated "hallucination" responses that are not factual. Consequently, attribution for each claim in responses becomes a common solution to improve the factuality and verifiability. Existing researches mainly focus on how to provide accurate citations for the response, which largely overlook the importance of identifying the claims or statements for each response. To bridge this gap, we introduce a new claim decomposition benchmark, which requires building system that can identify atomic and checkworthy claims for LLM responses. Specifically, we present the Chinese Atomic Claim Decomposition Dataset (CACDD), which builds on the WebCPM dataset with additional expert annotations to ensure high data quality. The CACDD encompasses a collection of 500 human-annotated question-answer pairs, including a total of 4956 atomic claims. We further propose a new pipeline for human annotation and describe the challenges of this task. In addition, we provide experiment results on zero-shot, few-shot and fine-tuned LLMs as baselines. The results show that the claim decomposition is highly challenging and requires further explorations. All code and data are publicly available at \url{https://github.com/FBzzh/CACDD}.
Related papers
- Atomic Fact Decomposition Helps Attributed Question Answering [30.75332718824254]
Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a question.
This paper proposes an Atomic fact decomposition-based Retrieval and Editing framework.
It decomposes the generated long-form answers into molecular clauses and atomic facts by the instruction-tuned LLMs.
arXiv Detail & Related papers (2024-10-22T05:25:54Z) - Localizing and Mitigating Errors in Long-form Question Answering [79.63372684264921]
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension.
This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.
arXiv Detail & Related papers (2024-07-16T17:23:16Z) - Attribute or Abstain: Large Language Models as Long Document Assistants [58.32043134560244]
LLMs can help humans working with long documents, but are known to hallucinate.
Existing approaches to attribution have only been evaluated in RAG settings, where the initial retrieval confounds LLM performance.
This is crucially different from the long document setting, where retrieval is not needed, but could help.
We present LAB, a benchmark of 6 diverse long document tasks with attribution, and experiments with different approaches to attribution on 5 LLMs of different sizes.
arXiv Detail & Related papers (2024-07-10T16:16:02Z) - SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs [85.54906813106683]
We propose a simple yet effective framework to enhance open-domain question answering (ODQA) with large language models (LLMs)
SuRe helps LLMs predict more accurate answers for a given question, which are well-supported by the summarized retrieval (SuRe)
Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.6% in exact match (EM) and 4.0% in F1 score over standard prompting approaches.
arXiv Detail & Related papers (2024-04-17T01:15:54Z) - CuriousLLM: Elevating Multi-Document Question Answering with LLM-Enhanced Knowledge Graph Reasoning [0.9295048974480845]
We propose CuriousLLM, an enhancement that integrates a curiosity-driven reasoning mechanism into an LLM agent.
This mechanism enables the agent to generate relevant follow-up questions, thereby guiding the information retrieval process more efficiently.
Our experiments show that CuriousLLM significantly boosts LLM performance in multi-document question answering (MD-QA)
arXiv Detail & Related papers (2024-04-13T20:43:46Z) - From Chaos to Clarity: Claim Normalization to Empower Fact-Checking [57.024192702939736]
Claim Normalization (aka ClaimNorm) aims to decompose complex and noisy social media posts into more straightforward and understandable forms.
We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation.
Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures.
arXiv Detail & Related papers (2023-10-22T16:07:06Z) - WiCE: Real-World Entailment for Claims in Wikipedia [63.234352061821625]
We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from Wikipedia.
In addition to standard claim-level entailment, WiCE provides entailment judgments over sub-sentence units of the claim.
We show that real claims in our dataset involve challenging verification and retrieval problems that existing models fail to address.
arXiv Detail & Related papers (2023-03-02T17:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.