Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance
- URL: http://arxiv.org/abs/2602.03491v1
- Date: Tue, 03 Feb 2026 13:08:31 GMT
- Title: Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance
- Authors: Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang,
- Abstract summary: Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information.<n>Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability.<n>This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools?
- Score: 43.49944599088126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.
Related papers
- Multimodal Tabular Reasoning with Privileged Structured Information [67.40011423365712]
We introduce TabUlar Reasoning with Bridged infOrmation (sc Turbo)<n>sc Turbo benefits from a structure-aware reasoning trace generator based on DeepSeek-R1.<n>sc Turbo achieves state-of-the-art performance ($+7.2%$ vs. previous SOTA) across multiple datasets.
arXiv Detail & Related papers (2025-06-04T15:46:30Z) - Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports [4.2134954427867]
We propose a method to enhance LVLM-based table understanding by incorporating in-table textual content and layout features.<n> Experimental results demonstrate that these auxiliary modalities significantly improve performance.
arXiv Detail & Related papers (2025-05-23T08:36:22Z) - NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables [47.937322344856945]
textscNeedleInATable (NIAT) treats each table cell as a needle'' and requires models to extract the target cell based on cell locations or lookup questions.
arXiv Detail & Related papers (2025-04-09T03:46:56Z) - Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data [1.0621665950143144]
Current table summarization methods often neglect the crucial aspect of human-friendly output.<n>This paper introduces the Theme-Explanation Structure-based Table Summarization (Tabular-TX) pipeline.
arXiv Detail & Related papers (2025-01-17T08:42:49Z) - Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding [42.841205217768106]
"Tree-of-Table" is a novel approach designed to enhance LLMs' reasoning capabilities over large and complex tables.
We show that Tree-of-Table sets a new benchmark with superior performance, showcasing remarkable efficiency and generalization capabilities in large-scale table reasoning.
arXiv Detail & Related papers (2024-11-13T11:02:04Z) - Structure Guided Prompt: Instructing Large Language Model in Multi-Step
Reasoning by Exploring Graph Structure of the Text [44.81698187939784]
This paper introduces Structure Guided Prompt, a framework designed to improve the multi-step reasoning capabilities of Large Language Models (LLMs)
Our experiments show that this framework significantly enhances the reasoning capabilities of LLMs, enabling them to excel in a broader spectrum of natural language scenarios.
arXiv Detail & Related papers (2024-02-20T22:56:23Z) - TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively.
It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z) - Parrot Mind: Towards Explaining the Complex Task Reasoning of Pretrained Large Language Models with Template-Content Structure [66.33623392497599]
We show that a structure called template-content structure (T-C structure) can reduce the possible space from exponential level to linear level.
We demonstrate that models can achieve task composition, further reducing the space needed to learn from linear to logarithmic.
arXiv Detail & Related papers (2023-10-09T06:57:45Z) - Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.