Related papers: Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

URL: http://arxiv.org/abs/2510.07926v1
Date: Thu, 09 Oct 2025 08:22:24 GMT
Title: Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation
Authors: Adam Dejl, James Barry, Alessandra Pascale, Javier Carnerero Cano,
Abstract summary: Large language models (LLMs) have been found to produce outputs that are incomplete or selectively omit key information.<n>In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies.
Score: 46.697788643450785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation strategies: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing links, (2) a Q&A-based approach that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end method that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end approach compared to more complex methods, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.

Related papers

ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers [0.3431096786139341]
We introduce a novel reference-free metric for evaluating the conciseness of responses generated by large language models.<n>Our method quantifies non-essential content without relying on gold standard references.
arXiv Detail & Related papers (2025-11-20T23:03:23Z)
What Works for 'Lost-in-the-Middle' in LLMs? A Study on GM-Extract and Mitigations [1.2879523047871226]
GM-Extract is a novel benchmark dataset meticulously designed to evaluate LLM performance on retrieval of control variables.<n>We conduct a systematic evaluation of 7-8B parameter models on two multi-document tasks (key-value extraction and question-answering)<n>While a distinct U-shaped curve was not consistently observed, our analysis reveals a clear pattern of performance across models.
arXiv Detail & Related papers (2025-11-17T20:50:50Z)
ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering [3.131352561462676]
Large Language Models (LLMs) have contributed to the development of AI systems capable of factual question-answering.<n>No known study tests the LLMs' robustness when presented with obfuscated versions of questions.<n>We introduce ObfusQA, a framework with multi-tiered obfuscation levels to examine LLM capabilities.
arXiv Detail & Related papers (2025-08-10T12:27:52Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations. Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z)
Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach [0.0]
Large Language Models (LLMs) produce inaccurate outputs, also known as hallucinations. This paper introduces a supervised learning approach employing only four numerical features derived from tokens and vocabulary probabilities obtained from other evaluators. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks.
arXiv Detail & Related papers (2024-05-30T03:00:47Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations. We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.