STAR: Semantic Table Representation with Header-Aware Clustering and Adaptive Weighted Fusion
- URL: http://arxiv.org/abs/2601.15860v1
- Date: Thu, 22 Jan 2026 11:08:46 GMT
- Title: STAR: Semantic Table Representation with Header-Aware Clustering and Adaptive Weighted Fusion
- Authors: Shui-Hsiang Hsu, Tsung-Hsiang Chou, Chen-Jui Yu, Yao-Chung Fan,
- Abstract summary: STAR (Semantic Table Representation) is a lightweight framework that improves semantic table representation through semantic clustering and weighted fusion.<n>We show that STAR achieves consistently higher Recall than QGpT on all datasets.
- Score: 1.483000637348699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Table retrieval is the task of retrieving the most relevant tables from large-scale corpora given natural language queries. However, structural and semantic discrepancies between unstructured text and structured tables make embedding alignment particularly challenging. Recent methods such as QGpT attempt to enrich table semantics by generating synthetic queries, yet they still rely on coarse partial-table sampling and simple fusion strategies, which limit semantic diversity and hinder effective query-table alignment. We propose STAR (Semantic Table Representation), a lightweight framework that improves semantic table representation through semantic clustering and weighted fusion. STAR first applies header-aware K-means clustering to group semantically similar rows and selects representative centroid instances to construct a diverse partial table. It then generates cluster-specific synthetic queries to comprehensively cover the table's semantic space. Finally, STAR employs weighted fusion strategies to integrate table and query embeddings, enabling fine-grained semantic alignment. This design enables STAR to capture complementary information from structured and textual sources, improving the expressiveness of table representations. Experiments on five benchmarks show that STAR achieves consistently higher Recall than QGpT on all datasets, demonstrating the effectiveness of semantic clustering and adaptive weighted fusion for robust table representation. Our code is available at https://github.com/adsl135789/STAR.
Related papers
- HELIOS: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval [23.01689764247965]
Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering.<n>Existing studies use either early or late fusion, but face limitations.<n>We propose HELIOS, which combines the strengths of both approaches.
arXiv Detail & Related papers (2026-02-25T15:42:24Z) - CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval [1.483000637348699]
We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision.<n>CGPT consistently outperforms retrieval baselines, including QGpT, with an average R@1 improvement of 16.54 percent.<n>Results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval.
arXiv Detail & Related papers (2026-01-22T10:58:56Z) - TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding [52.59372043981724]
TableDART is a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models.<n>In addition, we propose a novel agent to cross-modal knowledge integration by analyzing outputs from text- and image-based models.
arXiv Detail & Related papers (2025-09-18T07:00:13Z) - Improving Table Understanding with LLMs and Entity-Oriented Search [24.3302301035859]
We introduce an entity-oriented search method to improve table understanding with large language models (LLMs)<n>This approach effectively leverages the semantic similarities between questions and table data, as well as the implicit relationships between table cells.<n>It focuses on table entities, ensuring that table cells are semantically tightly bound, thereby enhancing contextual clarity.
arXiv Detail & Related papers (2025-08-23T14:02:45Z) - Improving Table Retrieval with Question Generation from Partial Tables [2.2169618382995764]
We propose QGpT, a simple yet effective method that uses an LLM to generate synthetic questions based on small portions of a table.<n>The generated questions are then jointly embedded with the partial table segments used for generation, enhancing semantic alignment with user queries.
arXiv Detail & Related papers (2025-08-08T09:35:56Z) - Bridging Queries and Tables through Entities in Table Retrieval [70.13748256886288]
Entities are well-studied in the context of text retrieval, but there is a noticeable lack of research on their applications in table retrieval.<n>We propose an entity-enhanced training framework and design an interaction paradigm based on entity representations.<n>Our proposed framework is plug-and-play and flexible, making it easy to integrate into existing table retriever training processes.
arXiv Detail & Related papers (2025-04-09T03:16:33Z) - Tailoring Table Retrieval from a Field-aware Hybrid Matching Perspective [70.13748256886288]
Table retrieval is less explored compared to text retrieval.<n>Different table fields have varying matching preferences.<n>We introduce a Table-tailored HYbrid Matching rEtriever (THYME)
arXiv Detail & Related papers (2025-03-04T03:57:10Z) - Chain-of-Table: Evolving Tables in the Reasoning Chain for Table
Understanding [79.9461269253121]
We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts.
Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks.
arXiv Detail & Related papers (2024-01-09T07:46:26Z) - SEMv2: Table Separation Line Detection Based on Instance Segmentation [96.36188168694781]
We propose an accurate table structure recognizer, termed SEMv2 (SEM: Split, Embed and Merge)
We address the table separation line instance-level discrimination problem and introduce a table separation line detection strategy based on conditional convolution.
To comprehensively evaluate the SEMv2, we also present a more challenging dataset for table structure recognition, dubbed iFLYTAB.
arXiv Detail & Related papers (2023-03-08T05:15:01Z) - Retrieving Complex Tables with Multi-Granular Graph Representation
Learning [20.72341939868327]
The task of natural language table retrieval seeks to retrieve semantically relevant tables based on natural language queries.
Existing learning systems treat tables as plain text based on the assumption that tables are structured as dataframes.
We propose Graph-based Table Retrieval (GTR), a generalizable NLTR framework with multi-granular graph representation learning.
arXiv Detail & Related papers (2021-05-04T20:19:03Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.