Related papers: Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets

Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets

URL: http://arxiv.org/abs/2508.15910v1
Date: Thu, 21 Aug 2025 18:11:16 GMT
Title: Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets
Authors: Julian Oestreich, Lydia Müller,
Abstract summary: We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs)<n>We compare structured decoding to standard one-shot prompting across three benchmarks - E2E, Rotowire, and Livesum.<n>Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, but may degrade performance in contexts involving densely packed textual information.
Score: 0.2578242050187029
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs). While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. We systematically compare schema-guided (structured) decoding to standard one-shot prompting across three diverse benchmarks - E2E, Rotowire, and Livesum - using open-source LLMs of up to 32B parameters, assessing the performance of table generation approaches in resource-constrained settings. Our experiments cover a wide range of evaluation metrics at cell, row, and table levels. Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, particularly in scenarios demanding precise numerical alignment (Rotowire), but may degrade performance in contexts involving densely packed textual information (E2E) or extensive aggregation over lengthy texts (Livesum). We further analyze the suitability of different evaluation metrics and discuss the influence of model size.

Related papers

Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation [11.450834626205676]
Table-BiEval is a novel approach based on a human-free, self-supervised evaluation framework.<n>It calculates Content Semantic Accuracy and Normalized Tree Edit Distance to decouple structure from content.<n>Results reveal substantial variability, highlighting that mid-sized models can surprisingly outperform larger counterparts in structural efficiency.
arXiv Detail & Related papers (2026-01-09T07:38:27Z)
TabReX : Tabular Referenceless eXplainable Evaluation [15.411207072791806]
TabReX is a reference-less, property-driven framework for evaluating tables generated by large language models.<n>It computes interpretable, rubric-aware scores that quantify structural and factual fidelity.<n>To asses robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types.
arXiv Detail & Related papers (2025-12-17T19:20:20Z)
Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings [16.728984584960738]
This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings.<n>We investigate two primary in-process methods: sequential concatenation and parallel caching.<n>Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors.
arXiv Detail & Related papers (2025-10-09T19:45:54Z)
Multi-Dimensional Summarization Agents with Context-Aware Reasoning over Enterprise Tables [0.0]
We propose a novel framework for summarizing structured enterprise data across multiple dimensions using large language model (LLM)-based agents.<n>Our method introduces a multi-agent pipeline that extracts, analyzes, and summarizes multi-dimensional data using agents for slicing, variance detection, context construction, and LLM-based generation.<n>We evaluate the framework on Kaggle datasets and demonstrate significant improvements in faithfulness, relevance, and insight quality over baseline table summarization approaches.
arXiv Detail & Related papers (2025-08-10T05:27:42Z)
Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z)
Map&Make: Schema Guided Text to Table Generation [41.52038779169547]
Text-to-Table generation is an essential task for information retrieval.<n>We introduce a versatile approach, Map&Make, which "dissects" text into propositional atomic statements.<n>Our approach is tested against two challenging datasets, Rotowire and Livesum.
arXiv Detail & Related papers (2025-05-29T07:12:46Z)
Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation [56.82274763974443]
ICAT is an evaluation framework for measuring coverage of diverse factual information in long-form text generation.<n>It computes the alignment between the atomic factual claims and various aspects expected to be presented in the output.<n>Our framework offers interpretable and fine-grained analysis of diversity and coverage.
arXiv Detail & Related papers (2025-01-07T05:43:23Z)
Towards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models [0.0]
This study explores the effectiveness of various in-context learning strategies in language models (LMs) across benchmark datasets. We employ a large language model (LLM) self-evaluation approach using chain-of-thought reasoning and assess its correlation with human-aligned metrics like BERTScore. Our findings highlight the significant impact of examples in improving table-to-text generation and suggest that, while LLM self-evaluation has potential, its current alignment with human judgment could be enhanced.
arXiv Detail & Related papers (2024-10-15T09:19:42Z)
Multi-Dimensional Evaluation of Text Summarization with In-Context Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning. Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization. We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z)
Attend, Memorize and Generate: Towards Faithful Table-to-Text Generation in Few Shots [58.404516361586325]
Few-shot table-to-text generation is a task of composing fluent and faithful sentences to convey table content using limited data. This paper proposes a novel approach, Memorize and Generate (called AMG), inspired by the text generation process of humans.
arXiv Detail & Related papers (2022-03-01T20:37:20Z)
Few-Shot Table-to-Text Generation with Prototype Memory [14.69889589370148]
We propose a new framework: Prototype-to-Generate (P2G), for table-to-text generation under the few-shot scenario. The proposed framework utilizes the retrieved prototypes, which are jointly selected by an IR system and a novel prototype selector. Experimental results on three benchmark datasets with three state-of-the-art models demonstrate that the proposed framework significantly improves the model performance.
arXiv Detail & Related papers (2021-08-27T22:16:30Z)
Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z)
Towards Faithful Neural Table-to-Text Generation with Content-Matching Constraints [63.84063384518667]
We propose a novel Transformer-based generation framework to achieve the goal. Core techniques in our method to enforce faithfulness include a new table-text optimal-transport matching loss. To evaluate faithfulness, we propose a new automatic metric specialized to the table-to-text generation problem.
arXiv Detail & Related papers (2020-05-03T02:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.