Related papers: Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs

Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs

URL: http://arxiv.org/abs/2401.14640v1
Date: Fri, 26 Jan 2024 04:11:07 GMT
Title: Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs
Authors: Nan Hu, Jiaoyan Chen, Yike Wu, Guilin Qi, Sheng Bi, Tongtong Wu and Jeff Z. Pan
Abstract summary: We introduce a set of fine-grained categories for measuring the attribution, and develop a Complex Attributed Question Answering (CAQA) benchmark. Our analysis reveals that existing evaluators perform poorly under fine-grained attribution settings and exhibit weaknesses in complex citation-statement reasoning.
Score: 35.089203283068635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The attribution of question answering is to provide citations for supporting generated statements, and has attracted wide research attention. The current methods for automatically evaluating the attribution, which are often based on Large Language Models (LLMs), are still inadequate, particularly in recognizing subtle differences between attributions, and complex relationships between citations and statements. To compare these attribution evaluation methods and develop new ones, we introduce a set of fine-grained categories (i.e., supportive, insufficient, contradictory and irrelevant) for measuring the attribution, and develop a Complex Attributed Question Answering (CAQA) benchmark by leveraging knowledge graphs (KGs) for automatically generating attributions of different categories to question-answer pairs. Our analysis reveals that existing evaluators perform poorly under fine-grained attribution settings and exhibit weaknesses in complex citation-statement reasoning. Our CAQA benchmark, validated with human annotations, emerges as a promising tool for selecting and developing LLM attribution evaluators.

Related papers

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking [56.27361644734853]
Knowledge Graph Question Answering systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning.<n>Despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues.<n>We introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls.<n>Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
arXiv Detail & Related papers (2025-05-29T14:44:52Z)
LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z)
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making [1.3812010983144802]
This study evaluates large language models (LLMs) across diverse domains, including cybersecurity, medicine, and finance. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality.
arXiv Detail & Related papers (2024-06-25T20:52:31Z)
DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs [3.24692739098077]
Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. We evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models.
arXiv Detail & Related papers (2024-06-24T22:09:50Z)
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation [8.975024781390077]
We present MIRAGE --Model Internals-based RAG Explanations -- a plug-and-play approach using model internals for faithful answer attribution in question answering applications. We evaluate our proposed approach on a multilingual QA dataset, finding high agreement with human answer attribution.
arXiv Detail & Related papers (2024-06-19T16:10:26Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence [10.367359022491181]
We provide rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore)
arXiv Detail & Related papers (2024-02-17T01:56:19Z)
HGOT: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation [20.178644251662316]
We introduce the hierarchical graph of thoughts (HGOT) to enhance the retrieval of pertinent passages during in-context learning. The framework employs the divide-and-conquer strategy to break down complex queries into manageable sub-queries. It refines self-consistency majority voting for answer selection, which incorporates the recently proposed citation recall and precision metrics.
arXiv Detail & Related papers (2024-02-14T18:41:19Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering [14.366087533102656]
Question answering (QA) can only make progress if we know if an answer is correct. Current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments.
arXiv Detail & Related papers (2024-01-24T01:30:25Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
Automatic Evaluation of Attribution by Large Language Models [24.443271739599194]
We investigate the automatic evaluation of attribution given by large language models (LLMs) We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation. We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing.
arXiv Detail & Related papers (2023-05-10T16:58:33Z)
Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts. Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA) First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA) Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.