Related papers: Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering

Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering

URL: http://arxiv.org/abs/2505.14131v1
Date: Tue, 20 May 2025 09:36:17 GMT
Title: Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering
Authors: Wei Zhou, Mohsen Mesgar, Heike Adel, Annemarie Friedrich,
Abstract summary: We conduct the first controlled study on the effectiveness of several combinations of table representations and models from two perspectives.<n>We find that the best combination of table representation and model varies across setups.<n>We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement.
Score: 16.790216473975146
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In table question answering (TQA), tables are encoded as either texts or images. Prior work suggests that passing images of tables to multi-modal large language models (MLLMs) performs comparably to or even better than using textual input with large language models (LLMs). However, the lack of controlled setups limits fine-grained distinctions between these approaches. In this paper, we conduct the first controlled study on the effectiveness of several combinations of table representations and models from two perspectives: question complexity and table size. We build a new benchmark based on existing TQA datasets. In a systematic analysis of seven pairs of MLLMs and LLMs, we find that the best combination of table representation and model varies across setups. We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement compared to using both representations indiscriminately.

Related papers

HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization [48.240146108630704]
This paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image.<n> Experimental results on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO.
arXiv Detail & Related papers (2025-02-24T16:50:55Z)
SynTQA: Synergistic Table-based Question Answering via Mixture of Text-to-SQL and E2E TQA [25.09488366689108]
Text-to- parsing and end-to-end question answering (E2E TQA) are two main approaches for Table-based Question Answering task. Despite success on multiple benchmarks, they have yet to be compared and their synergy remains unexplored. We identify different strengths and weaknesses through evaluating state-of-the-art models on benchmark datasets.
arXiv Detail & Related papers (2024-09-25T07:18:45Z)
FLEXTAF: Enhancing Table Reasoning with Flexible Tabular Formats [48.47559543509975]
We propose FLEXTAF-Single and FLEXTAF-Vote to enhance table reasoning performance by employing flexible formats. Our experiments on WikiTableQuestions and TabFact reveal significant improvements, with average gains of 2.3% and 4.8%.
arXiv Detail & Related papers (2024-08-16T17:00:11Z)
Multimodal Table Understanding [26.652797853893233]
How to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications. We propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests. We develop Table-LLaVA, a generalist multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks.
arXiv Detail & Related papers (2024-06-12T11:27:03Z)
TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains [4.828743805126944]
This paper establishes a benchmark for table visual question answering, referred to as the TableVQA-Bench. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA.
arXiv Detail & Related papers (2024-04-30T02:05:18Z)
TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively. It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z)
HeLM: Highlighted Evidence augmented Language Model for Enhanced Table-to-Text Generation [7.69801337810352]
We conduct parameter-efficient fine-tuning on the LLaMA2 model. Our approach involves injecting reasoning information into the input by emphasizing table-specific row data. On both the FetaQA and QTSumm datasets, our approach achieved state-of-the-art results.
arXiv Detail & Related papers (2023-11-15T12:02:52Z)
Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text Documents via Semantic-Oriented Hierarchical Graphs [79.0426838808629]
We propose TAT-DQA, i.e. to answer the question over a visually-rich table-text document. Specifically, we propose a novel Doc2SoarGraph framework with enhanced discrete reasoning capability. We conduct extensive experiments on TAT-DQA dataset, and the results show that our proposed framework outperforms the best baseline model by 17.73% and 16.91% in terms of Exact Match (EM) and F1 score respectively on the test set.
arXiv Detail & Related papers (2023-05-03T07:30:32Z)
Table Retrieval May Not Necessitate Table-specific Model Design [83.27735758203089]
We focus on the task of table retrieval, and ask: "is table-specific model design necessary for table retrieval?" Based on an analysis on a table-based portion of the Natural Questions dataset (NQ-table), we find that structure plays a negligible role in more than 70% of the cases. We then experiment with three modules to explicitly encode table structures, namely auxiliary row/column embeddings, hard attention masks, and soft relation-based attention biases. None of these yielded significant improvements, suggesting that table-specific model design may not be necessary for table retrieval.
arXiv Detail & Related papers (2022-05-19T20:35:23Z)
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing. We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar. To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.