Related papers: TabReX : Tabular Referenceless eXplainable Evaluation

TabReX : Tabular Referenceless eXplainable Evaluation

URL: http://arxiv.org/abs/2512.15907v1
Date: Wed, 17 Dec 2025 19:20:20 GMT
Title: TabReX : Tabular Referenceless eXplainable Evaluation
Authors: Tejas Anvekar, Juhna Park, Aparna Garimella, Vivek Gupta,
Abstract summary: TabReX is a reference-less, property-driven framework for evaluating tables generated by large language models.<n>It computes interpretable, rubric-aware scores that quantify structural and factual fidelity.<n>To asses robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types.
Score: 15.411207072791806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
SCORE: A Semantic Evaluation Framework for Generative Document Parsing [2.5101597298392098]
Multi-modal generative document parsing systems produce semantically correct yet structurally divergent outputs.<n>Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior.<n>We introduce SCORE, an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks.
arXiv Detail & Related papers (2025-09-16T16:06:19Z)
TabStruct: Measuring Structural Fidelity of Tabular Data [28.606994119562163]
We introduce a new evaluation metric, $textbfglobal utility$, which enables the assessment of structural fidelity even in the absence of ground-truth causal structures.<n>We also present the TabStruct benchmark suite, including all datasets, evaluation pipelines, and raw results.
arXiv Detail & Related papers (2025-09-15T14:08:20Z)
LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence [61.46575527504109]
LimiX-16M and LimiX-2M treat structured data as a joint distribution over variables and missingness.<n>We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios.
arXiv Detail & Related papers (2025-09-03T17:39:08Z)
Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets [0.2578242050187029]
We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs)<n>We compare structured decoding to standard one-shot prompting across three benchmarks - E2E, Rotowire, and Livesum.<n>Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, but may degrade performance in contexts involving densely packed textual information.
arXiv Detail & Related papers (2025-08-21T18:11:16Z)
LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z)
Multimodal Tabular Reasoning with Privileged Structured Information [67.40011423365712]
We introduce TabUlar Reasoning with Bridged infOrmation (sc Turbo)<n>sc Turbo benefits from a structure-aware reasoning trace generator based on DeepSeek-R1.<n>sc Turbo achieves state-of-the-art performance ($+7.2%$ vs. previous SOTA) across multiple datasets.
arXiv Detail & Related papers (2025-06-04T15:46:30Z)
TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation [10.212570261759204]
We propose a rubric-based evaluation framework that integrates multi-level structural descriptors with fine-grained contextual signals.<n>We introduce TabXEval, an eXhaustive and eXplainable two-phase evaluation framework.<n>We evaluate TabXEval on TabXBench, a diverse, multi-domain benchmark featuring realistic table perturbations and human annotations.
arXiv Detail & Related papers (2025-05-28T09:50:29Z)
How Well Does Your Tabular Generator Learn the Structure of Tabular Data? [10.974400005358193]
In this paper, we introduce TabStruct, a novel evaluation benchmark that positions structural fidelity as a core evaluation dimension.<n>We show that structural fidelity offers a task-independent, domain-agnostic evaluation dimension.
arXiv Detail & Related papers (2025-03-12T14:54:58Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers [56.56591337457137]
We propose an accurate and end-to-end transformer-based table structure recognition method, referred to as TRUST. Transformers are suitable for table structure recognition because of their global computations, perfect memory, and parallel computation. We conduct experiments on several popular benchmarks including PubTabNet and SynthTable, our method achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-08-31T08:33:36Z)
Towards Faithful Neural Table-to-Text Generation with Content-Matching Constraints [63.84063384518667]
We propose a novel Transformer-based generation framework to achieve the goal. Core techniques in our method to enforce faithfulness include a new table-text optimal-transport matching loss. To evaluate faithfulness, we propose a new automatic metric specialized to the table-to-text generation problem.
arXiv Detail & Related papers (2020-05-03T02:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.