Related papers: TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

URL: http://arxiv.org/abs/2506.03949v2
Date: Wed, 11 Jun 2025 15:37:07 GMT
Title: TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering
Authors: Junnan Zhu, Jingyi Wang, Bohan Yu, Xiaoyu Wu, Junbo Li, Lei Wang, Nan Xu,
Abstract summary: Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage.<n>We introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks.<n>To minimize the risk of data leakage, we collect all data from recent real-world documents.
Score: 18.173773939709733
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements. We make our dataset available here: https://github.com/wenge-research/TableEval.

Related papers

TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models [30.26407735827857]
Reasoning with table-structured data poses significant challenges for large language models (LLMs)<n>We present a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities.<n>We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT.
arXiv Detail & Related papers (2025-06-23T09:02:04Z)
Benchmarking Table Comprehension In The Wild [9.224698222634789]
TableQuest is a new benchmark designed to evaluate the holistic table comprehension capabilities of Large Language Models (LLMs)<n>We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations.
arXiv Detail & Related papers (2024-12-13T05:52:37Z)
TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.<n>TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.<n>Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z)
Table Question Answering for Low-resourced Indic Languages [71.57359949962678]
TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. We introduce a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models.
arXiv Detail & Related papers (2024-10-04T16:26:12Z)
HiddenTables & PyQTax: A Cooperative Game and Dataset For TableQA to Ensure Scale and Data Privacy Across a Myriad of Taxonomies [9.09415727445941]
We propose a cooperative game dubbed "HiddenTables" as a potential resolution to this challenge. "HiddenTables" is played between the code-generating "r" and the "Oracle windows" which evaluates the ability of the agents to solve Table QA tasks. We provide evidential experiments on a diverse set of tables that demonstrate an LLM's collective inability to generalize and perform on complex queries.
arXiv Detail & Related papers (2024-06-16T04:53:29Z)
Multimodal Table Understanding [26.652797853893233]
How to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications. We propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests. We develop Table-LLaVA, a generalist multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks.
arXiv Detail & Related papers (2024-06-12T11:27:03Z)
Large Language Model for Table Processing: A Survey [18.32332372134988]
This survey provides a comprehensive overview of table-related tasks. It covers traditional tasks like table question answering as well as emerging fields such as spreadsheet manipulation and table data analysis.
arXiv Detail & Related papers (2024-02-04T00:47:53Z)
TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively. It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z)
TableQAKit: A Comprehensive and Practical Toolkit for Table-based Question Answering [23.412691101965414]
TableQAKit is the first comprehensive toolkit designed specifically for TableQA. TableQAKit is open-source with an interactive interface that includes visual operations, and comprehensive data for ease of use.
arXiv Detail & Related papers (2023-10-23T16:33:23Z)
MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering [61.48881995121938]
Real-world queries are complex in nature, often over multiple tables in a relational database or web page. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers.
arXiv Detail & Related papers (2023-05-22T08:25:15Z)
OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering [106.73213656603453]
We develop a simple table-based QA model with minimal annotation effort. We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
arXiv Detail & Related papers (2022-07-08T01:23:45Z)
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing. We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar. To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.