Related papers: How well do LLMs reason over tabular data, really?

How well do LLMs reason over tabular data, really?

URL: http://arxiv.org/abs/2505.07453v2
Date: Mon, 02 Jun 2025 15:39:29 GMT
Title: How well do LLMs reason over tabular data, really?
Authors: Cornelius Wolff, Madelon Hulsebos,
Abstract summary: Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data.<n>We show that an LLM-as-a-judge procedure yields more reliable performance insights.<n>We then extend the tabular inputs reflecting three common characteristics in practice: missing values, duplicate entities, and structural variations.
Score: 2.5015086558362247
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM's realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM's performance on analytical tabular queries? Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We show that an LLM-as-a-judge procedure yields more reliable performance insights and unveil a significant deficit in tabular reasoning performance of LLMs. We then extend the tabular inputs reflecting three common characteristics in practice: 1) missing values, 2) duplicate entities, and 3) structural variations. Experiments show that the tabular reasoning capabilities of general-purpose LLMs suffer from these variations, stressing the importance of improving their robustness for realistic tabular inputs.

Related papers

Large Language Models for Spreadsheets: Benchmarking Progress and Evaluating Performance with FLARE [0.0]
Large Language Models (LLMs) have demonstrated some significant capabilities across various domains.<n>This study introduces a benchmark framework to evaluate the performance of leading LLMs in executing spreadsheet functions.
arXiv Detail & Related papers (2025-06-19T03:47:38Z)
LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z)
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios [58.90106984375913]
RuleArena is a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning.<n> Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions.
arXiv Detail & Related papers (2024-12-12T06:08:46Z)
Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering [29.384514074911955]
We propose a model named TabLaP that uses Large Language Models as a planner rather than an answer generator.<n>We show that TabLaP is substantially more accurate than the state-of-the-art models, improving the answer accuracy by 5.7% and 5.8% on the two datasets.
arXiv Detail & Related papers (2024-10-10T05:34:00Z)
Enhancing Temporal Understanding in LLMs for Semi-structured Tables [50.59009084277447]
We conduct a comprehensive analysis of temporal datasets to pinpoint the specific limitations of large language models (LLMs) Our investigation leads to enhancements in TempTabQA, a dataset specifically designed for temporal temporal question answering. We introduce a novel approach, C.L.E.A.R. to strengthen LLM capabilities in this domain.
arXiv Detail & Related papers (2024-07-22T20:13:10Z)
Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models [21.10890310571397]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over.<n>This work introduces a variety of different techniques to assess whether a language model has seen a dataset during training.<n>We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training.
arXiv Detail & Related papers (2024-04-09T10:58:21Z)
Small Models are LLM Knowledge Triggers on Medical Tabular Prediction [39.78560996984352]
We propose SERSAL, a general self-prompting method by synergy learning with small models.<n>We show that SERSAL attains substantial improvement compared to linguistic prompting methods.
arXiv Detail & Related papers (2024-03-03T17:35:52Z)
A Survey of Table Reasoning with Large Language Models [55.2326738851157]
Using Large Language Models (LLMs) has become the mainstream method for table reasoning. We analyze the mainstream techniques used to improve table reasoning performance in the LLM era. We provide research directions from both the improvement of existing methods and the expansion of practical applications.
arXiv Detail & Related papers (2024-02-13T07:17:52Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study [44.39031420687302]
Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. We try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs. We propose $textitself-augmentation$ for effective structural prompting, such as critical value / range identification.
arXiv Detail & Related papers (2023-05-22T14:23:46Z)
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT) This paper systematically investigates the advantages and challenges of LLMs for MMT. We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)
Large Language Models are few(1)-shot Table Reasoners [31.036914270008978]
Large language models (LLMs) are generally excellent few-shot reasoners to solve text reasoning tasks. In this paper, we aim at understanding how well LLMs can perform on table tasks with few-shot in-context learning.
arXiv Detail & Related papers (2022-10-13T04:08:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.