MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
- URL: http://arxiv.org/abs/2506.05587v1
- Date: Thu, 05 Jun 2025 21:05:03 GMT
- Title: MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
- Authors: Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish,
- Abstract summary: We introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks.<n> MMTU is designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level.<n>We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models.
- Score: 70.47478110973042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.
Related papers
- TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models [30.26407735827857]
Reasoning with table-structured data poses significant challenges for large language models (LLMs)<n>We present a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities.<n>We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT.
arXiv Detail & Related papers (2025-06-23T09:02:04Z) - RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis [16.572608600078922]
RealHiTBench is a benchmark designed to evaluate the performance of Large Language Models (LLMs) across a variety of input formats.<n>Our experimental results, using 25 state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark.<n>We also develop TreeThinker, a tree-based pipeline that organizes hierarchical headers into a tree structure.
arXiv Detail & Related papers (2025-06-16T12:19:08Z) - NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables [32.9031799179503]
textscNeedleInATable (NIAT) treats each table cell as a needle'' and requires models to extract the target cell based on cell locations or lookup questions.<n>Our data, code and models will be released to facilitate future research.
arXiv Detail & Related papers (2025-04-09T03:46:56Z) - Benchmarking Table Comprehension In The Wild [9.224698222634789]
TableQuest is a new benchmark designed to evaluate the holistic table comprehension capabilities of Large Language Models (LLMs)<n>We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations.
arXiv Detail & Related papers (2024-12-13T05:52:37Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - MATATA: Weakly Supervised End-to-End MAthematical Tool-Augmented Reasoning for Tabular Applications [0.9831489366502302]
This work introduces MATATA, a novel weakly supervised end-to-end approach to train multi-step reasoning language agents.<n>MATATA presents an annotation-free paradigm for each agent to enhance 3.8B/8B SLMs.<n>Experiments demonstrate that MATATA achieves state-of-the-art on FinQA, and on TAT-QA among reasoning methods based on open-source SLMs.
arXiv Detail & Related papers (2024-11-28T05:12:17Z) - TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.<n>TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.<n>Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z) - TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning [61.14586098005874]
Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning.
We introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools.
TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability.
arXiv Detail & Related papers (2024-09-18T06:19:59Z) - RelBench: A Benchmark for Deep Learning on Relational Databases [78.52438155603781]
We present RelBench, a public benchmark for solving tasks over databases with graph neural networks.
We use RelBench to conduct the first comprehensive study of Deep Learning infrastructure.
RDL learns better whilst reducing human work needed by more than an order of magnitude.
arXiv Detail & Related papers (2024-07-29T14:46:13Z) - TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy [81.76462101465354]
We present a novel large vision-hugging model, TabPedia, equipped with a concept synergy mechanism.
This unified framework allows TabPedia to seamlessly integrate VTU tasks, such as table detection, table structure recognition, table querying, and table question answering.
To better evaluate the VTU task in real-world scenarios, we establish a new and comprehensive table VQA benchmark, ComTQA.
arXiv Detail & Related papers (2024-06-03T13:54:05Z) - Large Language Model for Table Processing: A Survey [18.32332372134988]
This survey provides a comprehensive overview of table-related tasks.
It covers traditional tasks like table question answering as well as emerging fields such as spreadsheet manipulation and table data analysis.
arXiv Detail & Related papers (2024-02-04T00:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.