CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval
- URL: http://arxiv.org/abs/2601.15849v1
- Date: Thu, 22 Jan 2026 10:58:56 GMT
- Title: CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval
- Authors: Tsung-Hsiang Chou, Chen-Jui Yu, Shui-Hsiang Hsu, Yao-Chung Fan,
- Abstract summary: We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision.<n>CGPT consistently outperforms retrieval baselines, including QGpT, with an average R@1 improvement of 16.54 percent.<n>Results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval.
- Score: 1.483000637348699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query-table mismatch. Recent LLM-based retrieval augmentation methods mitigate this issue by generating synthetic queries, yet they often rely on heuristic partial-table selection and seldom leverage these synthetic queries as supervision to improve the embedding model. We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision. CGPT constructs semantically diverse partial tables by clustering table instances using K-means and sampling across clusters to broaden semantic coverage. An LLM then generates synthetic queries for these partial tables, which are used in hard-negative contrastive fine-tuning to refine the embedding model. Experiments across four public benchmarks (MimoTable, OTTQA, FetaQA, and E2E-WTQ) show that CGPT consistently outperforms retrieval baselines, including QGpT, with an average R@1 improvement of 16.54 percent. In a unified multi-domain corpus setting, CGPT further demonstrates strong cross-domain generalization and remains effective even when using smaller LLMs for synthetic query generation. These results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval. Our code is available at https://github.com/yumeow0122/CGPT.
Related papers
- STAR: Semantic Table Representation with Header-Aware Clustering and Adaptive Weighted Fusion [1.483000637348699]
STAR (Semantic Table Representation) is a lightweight framework that improves semantic table representation through semantic clustering and weighted fusion.<n>We show that STAR achieves consistently higher Recall than QGpT on all datasets.
arXiv Detail & Related papers (2026-01-22T11:08:46Z) - CORE-T: COherent REtrieval of Tables for Text-to-SQL [91.76918495375384]
CORE-T is a scalable, training-free framework that enriches tables with purpose metadata and pre-computes a lightweight table-compatibility cache.<n>Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables.
arXiv Detail & Related papers (2026-01-19T14:51:23Z) - Hint-Augmented Re-ranking: Efficient Product Search using LLM-Based Query Decomposition [20.966359103135762]
We show that LLMs can uncover latent intent behind superlatives in e-commerce queries.<n>Our approach decomposes queries into attribute-value hints generated concurrently with retrieval.<n>Our method improves search performanc eby 10.9 points in MAP and ranking by 5.9 points in MRR over baselines.
arXiv Detail & Related papers (2025-11-17T23:53:25Z) - A Hybrid Search for Complex Table Question Answering in Securities Report [0.9430947207126281]
We propose a cell extraction method for Table Question Answering (TQA) without manual identification.<n>Our approach estimates table headers by computing similarities between a given question and individual cells.<n>We then select as the answer the cells at the intersection of the most relevant row and column.
arXiv Detail & Related papers (2025-11-12T10:19:27Z) - REaR: Retrieve, Expand and Refine for Effective Multitable Retrieval [46.38349148493421]
REAR (Retrieve, Expand and Refine) is a three-stage framework for efficient, high-fidelity multi-table retrieval.<n>Rear retrieves query-aligned tables, expands these with structurally joinable tables, and refines them by pruning noisy or weakly related candidates.<n>Rear is retriever-agnostic and consistently improves dense/sparse retrievers on complex table QA datasets.
arXiv Detail & Related papers (2025-11-02T05:01:04Z) - LLM-guided Hierarchical Retrieval [54.73080745446999]
LATTICE is a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity.<n>A central challenge in such LLM-guided search is that the model's relevance judgments are noisy, context-dependent, and unaware of the hierarchy.<n>Our framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark.
arXiv Detail & Related papers (2025-10-15T07:05:17Z) - HyST: LLM-Powered Hybrid Retrieval over Semi-Structured Tabular Data [0.4779196219827507]
HyST (Hybrid retrieval over Semi-structured Tabular data) is a hybrid retrieval framework that combines structured filtering with semantic embedding search.<n>We show that HyST consistently outperforms tradtional baselines on a semi-structured benchmark.
arXiv Detail & Related papers (2025-08-25T14:06:27Z) - Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models [52.94091440130039]
Table reasoning (TR) requires structured reasoning over semi-structured data.<n>Small language models (SLMs) have limited capacity compared to large LMs (LLMs, e.g., GPT-4o)<n>We propose program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR) by generating executable programs.<n>Experiments on four TR benchmarks demonstrate that Table-r1 outperforms all SLM-based methods.
arXiv Detail & Related papers (2025-06-06T14:52:19Z) - LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z) - Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA)
We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity.
We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.