An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation
- URL: http://arxiv.org/abs/2411.19203v1
- Date: Thu, 28 Nov 2024 15:23:12 GMT
- Title: An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation
- Authors: Joy Mahapatra, Utpal Garain,
- Abstract summary: Large Language Models (LLMs) have shown exceptional performance across various Data-to-Text Generation (DTG) tasks.
generating factually consistent text in DTG remains challenging for LLMs.
This paper provides an extensive evaluation of factual consistency in LLMs for DTG.
- Score: 1.8876415010297893
- License:
- Abstract: Large Language Models (LLMs) have shown exceptional performance across various Data-to-Text Generation (DTG) tasks. However, generating factually consistent text in DTG remains challenging for LLMs. Despite this, in-depth evaluations of LLM factual consistency for DTG remain missing in the current literature. This paper addresses this gap by providing an extensive evaluation of factual consistency in LLMs for DTG. Our evaluation covers five widely used DTG datasets (E2E, ViGGo, WikiTableText, DART, and WebNLG) and five prominent LLM families (T5, BART, OPT, BLOOM, and Llama 2). To ensure a thorough evaluation of factual consistency, we use four state-of-the-art automatic metrics and include essential human assessments. Our extensive evaluations reveals three key findings regarding factual consistency in LLMs for DTG. First, Llama 2 often excels in generating factually consistent text, although smaller models like T5 and BART can achieve strong factual consistency on larger, lexically less-diverse datasets. Second, the average rate of change (AROC) indicates that increasing model size (number of model trainable parameters) generally enhances factual consistency of LLMs in DTG. Third, we observe that source-reference divergence (i.e., when the reference text diverges semantically from the source) typically reduces the factual consistency of LLMs in DTG.
Related papers
- Factual Inconsistency in Data-to-Text Generation Scales Exponentially with LLM Size: A Statistical Validation [1.6795461001108096]
This paper explores the impact of large language models (LLMs) size on factual inconsistency in data-to-text generation (D2T)
We employ a statistical validation framework consisting of three key stages: predictive performance estimation, goodness-of-fit assessment, and comparative analysis.
For a comprehensive empirical study, we analyze three popular LLM families across five D2T datasets, measuring factual inconsistency inversely using four state-of-the-art consistency metrics.
arXiv Detail & Related papers (2025-02-17T23:24:00Z) - Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)
We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy.
We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation [1.8876415010297893]
Data-to-text (D2T) generation aims to generate human-readable text from semi-structured data, such as tables and graphs.
No research has been conducted to illustrate the impact of model size on the performance of fine-tuned LLMs for D2T tasks.
We aim to elucidate both the advantages and limitations of scaling model sizes across five widely used D2T datasets.
arXiv Detail & Related papers (2024-07-19T07:54:30Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task.
We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z) - Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation [0.0]
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation.
We find that open LLMs can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd.
arXiv Detail & Related papers (2024-01-18T18:15:46Z) - Large Language Models Can Learn Temporal Reasoning [11.599570446840547]
We propose TG-LLM, a novel framework towards language-based temporal reasoning.
Instead of reasoning over the original context, we adopt a latent representation, temporal graph (TG)
A synthetic dataset (TGQA) is fully controllable and requires minimal supervision.
arXiv Detail & Related papers (2024-01-12T19:00:26Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs'
Generation [60.77990074569754]
We present a computation-efficient framework that steers a frozen Pre-Trained Language Model towards more commonsensical generation.
Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score.
We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head.
arXiv Detail & Related papers (2023-10-25T23:32:12Z) - Investigating Table-to-Text Generation Capabilities of LLMs in
Real-World Information Seeking Scenarios [32.84523661055774]
Tabular data is prevalent across various industries, necessitating significant time and effort for users to understand and manipulate for their information-seeking purposes.
The adoption of large language models (LLMs) in real-world applications for table information seeking remains underexplored.
This paper investigates the table-to-text capabilities of different LLMs using four datasets within two real-world information seeking scenarios.
arXiv Detail & Related papers (2023-05-24T10:22:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.