The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
- URL: http://arxiv.org/abs/2502.19412v2
- Date: Sun, 02 Mar 2025 16:16:39 GMT
- Title: The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
- Authors: Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, Michal Shmueli-Scheuer,
- Abstract summary: ToRR is a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks.<n>We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR.
- Score: 45.420943398134845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.
Related papers
- Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.
They generate only a limited range of perturbations for a single Information Extraction (IE) task.
Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.
We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z) - FLEXTAF: Enhancing Table Reasoning with Flexible Tabular Formats [48.47559543509975]
We propose FLEXTAF-Single and FLEXTAF-Vote to enhance table reasoning performance by employing flexible formats.
Our experiments on WikiTableQuestions and TabFact reveal significant improvements, with average gains of 2.3% and 4.8%.
arXiv Detail & Related papers (2024-08-16T17:00:11Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Bridging Textual and Tabular Worlds for Fact Verification: A Lightweight, Attention-Based Model [34.1224836768324]
FEVEROUS is a benchmark and research initiative focused on fact extraction and verification tasks.
This paper introduces a simple yet powerful model that nullifies the need for modality conversion.
Our approach efficiently exploits latent connections between different data types, thereby yielding comprehensive and reliable verdict predictions.
arXiv Detail & Related papers (2024-03-26T03:54:25Z) - Observatory: Characterizing Embeddings of Relational Tables [15.808819332614712]
Researchers and practitioners are keen to leverage language and table embedding models in many new application contexts.
There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage.
We propose Observatory, a formal framework to systematically analyze embedding representations of relational tables.
arXiv Detail & Related papers (2023-10-05T00:58:45Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Think Twice: Measuring the Efficiency of Eliminating Prediction
Shortcuts of Question Answering Models [3.9052860539161918]
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature.
We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA)
We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
arXiv Detail & Related papers (2023-05-11T14:35:00Z) - Making Table Understanding Work in Practice [9.352813774921655]
We discuss three challenges of deploying table understanding models and propose a framework to address them.
We present SigmaTyper which encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model.
arXiv Detail & Related papers (2021-09-11T03:38:24Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.