TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
- URL: http://arxiv.org/abs/2506.18421v2
- Date: Mon, 14 Jul 2025 06:09:12 GMT
- Title: TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
- Authors: Ce Li, Xiaofan Liu, Zhiyan Song, Ce Chi, Chen Zhao, Jingjing Yang, Zhendong Wang, Kexin Yang, Boshen Shi, Xing Wang, Chao Deng, Junlan Feng,
- Abstract summary: Reasoning with table-structured data poses significant challenges for large language models (LLMs)<n>We present a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities.<n>We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT.
- Score: 30.26407735827857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on huggingface.co/datasets/JT-LM/JIUTIAN-TReB and the framework on github.com/JT-LM/jiutian-treb.
Related papers
- RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis [16.572608600078922]
RealHiTBench is a benchmark designed to evaluate the performance of Large Language Models (LLMs) across a variety of input formats.<n>Our experimental results, using 25 state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark.<n>We also develop TreeThinker, a tree-based pipeline that organizes hierarchical headers into a tree structure.
arXiv Detail & Related papers (2025-06-16T12:19:08Z) - MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark [70.47478110973042]
We introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks.<n> MMTU is designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level.<n>We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models.
arXiv Detail & Related papers (2025-06-05T21:05:03Z) - NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables [32.9031799179503]
textscNeedleInATable (NIAT) treats each table cell as a needle'' and requires models to extract the target cell based on cell locations or lookup questions.<n>Our data, code and models will be released to facilitate future research.
arXiv Detail & Related papers (2025-04-09T03:46:56Z) - Better Think with Tables: Tabular Structures Enhance LLM Comprehension for Data-Analytics Requests [33.471112091886894]
Large Language Models (LLMs) often struggle with data-analytics requests related to information retrieval and data manipulation.<n>We introduce Thinking with Tables, where we inject tabular structures into LLMs for data-analytics requests.<n>We show that providing tables yields a 40.29 percent average performance gain along with better manipulation and token efficiency.
arXiv Detail & Related papers (2024-12-22T23:31:03Z) - Benchmarking Table Comprehension In The Wild [9.224698222634789]
TableQuest is a new benchmark designed to evaluate the holistic table comprehension capabilities of Large Language Models (LLMs)<n>We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations.
arXiv Detail & Related papers (2024-12-13T05:52:37Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.<n>TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.<n>Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z) - TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning [61.14586098005874]
Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning.<n>We introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools.<n>TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability.
arXiv Detail & Related papers (2024-09-18T06:19:59Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively.
It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.