Related papers: Benchmarking Table Comprehension In The Wild

Benchmarking Table Comprehension In The Wild

URL: http://arxiv.org/abs/2412.09884v1
Date: Fri, 13 Dec 2024 05:52:37 GMT
Title: Benchmarking Table Comprehension In The Wild
Authors: Yikang Pan, Yi Zhu, Rand Xie, Yizhi Liu,
Abstract summary: TableQuest is a new benchmark designed to evaluate the holistic table comprehension capabilities of Large Language Models (LLMs)<n>We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations.
Score: 9.224698222634789
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs), while being increasingly dominant on a myriad of knowledge-intensive activities, have only had limited success understanding lengthy table-text mixtures, such as academic papers and financial reports. Recent advances of long-context LLMs have opened up new possibilities for this field. Nonetheless, we identify two roadblocks: (1) Prior benchmarks of table question answering (TableQA) have focused on isolated tables without context, making it hard to evaluate models in real-world scenarios. (2) Prior benchmarks have focused on some narrow skill sets of table comprehension such as table recognition, data manipulation/calculation, table summarization etc., while a skilled human employs those skills collectively. In this work, we introduce TableQuest, a new benchmark designed to evaluate the holistic table comprehension capabilities of LLMs in the natural table-rich context of financial reports. We employ a rigorous data processing and filtering procedure to ensure that the question-answer pairs are logical, reasonable, and diverse. We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations. We conclude with a qualitative study of the failure modes and discuss the challenges of constructing a challenging benchmark. We make the evaluation data, judging procedure and results of this study publicly available to facilitate research in this field.

Related papers

AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data [2.9631016562930546]
We present AraTable, a benchmark designed to evaluate the reasoning and understanding capabilities of large language models when applied to Arabic data.<n>AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning.<n>We propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges.
arXiv Detail & Related papers (2025-07-24T14:26:41Z)
TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models [30.26407735827857]
Reasoning with table-structured data poses significant challenges for large language models (LLMs)<n>We present a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities.<n>We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT.
arXiv Detail & Related papers (2025-06-23T09:02:04Z)
MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark [70.47478110973042]
We introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks.<n> MMTU is designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level.<n>We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models.
arXiv Detail & Related papers (2025-06-05T21:05:03Z)
TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering [18.173773939709733]
Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage.<n>We introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks.<n>To minimize the risk of data leakage, we collect all data from recent real-world documents.
arXiv Detail & Related papers (2025-06-04T13:39:01Z)
Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z)
NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables [32.9031799179503]
textscNeedleInATable (NIAT) treats each table cell as a needle'' and requires models to extract the target cell based on cell locations or lookup questions.<n>Our data, code and models will be released to facilitate future research.
arXiv Detail & Related papers (2025-04-09T03:46:56Z)
Interpretable LLM-based Table Question Answering [5.484058026469263]
Plan-of-s (POS) is an interpretable Table QA approach designed to improve users' understanding of model decision-making. We show that POS is the highest-quality explanation method, helps human users understand model behaviors, and facilitates model prediction verification. We observe high agreement (up to 90%) between LLMs and human users when making decisions based on the same explanations.
arXiv Detail & Related papers (2024-12-16T22:44:31Z)
TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z)
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering [33.64465594140019]
This paper investigates the application of Large Language Models (LLMs) in industrial scenarios. We propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands.
arXiv Detail & Related papers (2024-08-17T11:40:10Z)
On the Robustness of Language Models for Tabular Question Answering [7.486549276995143]
Large Language Models (LLMs) have been shown to tackle table comprehension tasks without specific training. We evaluate the robustness of LLMs on Wikipedia-based textbfWTQ, financial report-based textbfTAT-QA, and scientific claims-based textbfSCITAB, TQA datasets.
arXiv Detail & Related papers (2024-06-18T15:41:15Z)
Uncovering Limitations of Large Language Models in Information Seeking from Tables [28.19697259795014]
This paper introduces a more reliable benchmark for Table Information Seeking (TabIS) To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two options per question) instead of a text generation format.
arXiv Detail & Related papers (2024-06-06T14:30:59Z)
Wiki-TabNER:Advancing Table Interpretation Through Named Entity Recognition [19.423556742293762]
We analyse a widely used benchmark dataset for evaluation of TI tasks. To overcome this drawback, we construct and annotate a new more challenging dataset. We propose a prompting framework for evaluating the newly developed large language models.
arXiv Detail & Related papers (2024-03-07T15:22:07Z)
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering [53.56653281752486]
This study explores Large Language Models' mathematical reasoning on four financial question-answering datasets. We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps. We introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance.
arXiv Detail & Related papers (2024-02-17T05:10:18Z)
A Survey of Table Reasoning with Large Language Models [55.2326738851157]
Using Large Language Models (LLMs) has become the mainstream method for table reasoning. We analyze the mainstream techniques used to improve table reasoning performance in the LLM era. We provide research directions from both the improvement of existing methods and the expansion of practical applications.
arXiv Detail & Related papers (2024-02-13T07:17:52Z)
Large Language Model for Table Processing: A Survey [18.32332372134988]
This survey provides a comprehensive overview of table-related tasks. It covers traditional tasks like table question answering as well as emerging fields such as spreadsheet manipulation and table data analysis.
arXiv Detail & Related papers (2024-02-04T00:47:53Z)
TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively. It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z)
OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering [106.73213656603453]
We develop a simple table-based QA model with minimal annotation effort. We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
arXiv Detail & Related papers (2022-07-08T01:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.