RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking
- URL: http://arxiv.org/abs/2504.01346v4
- Date: Sun, 05 Oct 2025 07:24:41 GMT
- Title: RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking
- Authors: Jiaru Zou, Dongqi Fu, Sirui Chen, Xinrui He, Zihao Li, Yada Zhu, Jiawei Han, Jingrui He,
- Abstract summary: In real-world scenarios, beyond pure text, a substantial amount of knowledge is stored in tables.<n>We first propose a table-corpora-aware RAG framework, named T-RAG, which consists of the hierarchical memory index, multi-stage retrieval, and graph-aware prompting.
- Score: 63.253294691180635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating them with an external knowledge base to improve the answer relevance and accuracy. In real-world scenarios, beyond pure text, a substantial amount of knowledge is stored in tables, and user questions often require retrieving answers that are distributed across multiple tables. Retrieving knowledge from a table corpora (i.e., various individual tables) for a question remains nascent, at least, for (i) how to understand intra- and inter-table knowledge effectively, (ii) how to filter unnecessary tables and how to retrieve the most relevant tables efficiently, (iii) how to prompt LLMs to infer over the retrieval, (iv) how to evaluate the corresponding performance in a realistic setting. Facing the above challenges, in this paper, we first propose a table-corpora-aware RAG framework, named T-RAG, which consists of the hierarchical memory index, multi-stage retrieval, and graph-aware prompting for effective and efficient table knowledge retrieval and inference. Further, we first develop a multi-table question answering benchmark named MultiTableQA, which spans 3 different task types, 57,193 tables, and 23,758 questions in total, and the sources are all from real-world scenarios. Based on MultiTableQA, we did the holistic comparison over table retrieval methods, RAG methods, and table-to-graph representation learning methods, where T-RAG shows the leading accuracy, recall, and running time performance. Also, under T-RAG, we evaluate the inference ability upgrade of different LLMs. Code and Data are available at https://github.com/jiaruzouu/T-RAG
Related papers
- Efficient Table Retrieval and Understanding with Multimodal Large Language Models [22.49099892041409]
Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans.<n>These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities.<n>We propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images.
arXiv Detail & Related papers (2026-02-07T17:50:33Z) - CORE-T: COherent REtrieval of Tables for Text-to-SQL [91.76918495375384]
CORE-T is a scalable, training-free framework that enriches tables with purpose metadata and pre-computes a lightweight table-compatibility cache.<n>Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables.
arXiv Detail & Related papers (2026-01-19T14:51:23Z) - REaR: Retrieve, Expand and Refine for Effective Multitable Retrieval [46.38349148493421]
REAR (Retrieve, Expand and Refine) is a three-stage framework for efficient, high-fidelity multi-table retrieval.<n>Rear retrieves query-aligned tables, expands these with structurally joinable tables, and refines them by pruning noisy or weakly related candidates.<n>Rear is retriever-agnostic and consistently improves dense/sparse retrievers on complex table QA datasets.
arXiv Detail & Related papers (2025-11-02T05:01:04Z) - Improving Table Retrieval with Question Generation from Partial Tables [2.2169618382995764]
We propose QGpT, a simple yet effective method that uses an LLM to generate synthetic questions based on small portions of a table.<n>The generated questions are then jointly embedded with the partial table segments used for generation, enhancing semantic alignment with user queries.
arXiv Detail & Related papers (2025-08-08T09:35:56Z) - TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning [7.706148486477738]
Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering.<n>Existing RAG approaches exhibit critical limitations when applied to heterogeneous documents.
arXiv Detail & Related papers (2025-06-12T06:16:49Z) - Plugging Schema Graph into Multi-Table QA: A Human-Guided Framework for Reducing LLM Reliance [8.304761523814564]
We propose a graph-based framework that leverages human-curated relational knowledge to explicitly encode schema links and join paths.<n>Given a natural language query, our method searches this graph to construct interpretable reasoning chains, aided by pruning and sub-path merging strategies.<n>Experiments on both standard benchmarks and a realistic, large-scale dataset demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2025-06-04T20:21:52Z) - Bridging Queries and Tables through Entities in Table Retrieval [70.13748256886288]
Entities are well-studied in the context of text retrieval, but there is a noticeable lack of research on their applications in table retrieval.<n>We propose an entity-enhanced training framework and design an interaction paradigm based on entity representations.<n>Our proposed framework is plug-and-play and flexible, making it easy to integrate into existing table retriever training processes.
arXiv Detail & Related papers (2025-04-09T03:16:33Z) - Reasoning-Aware Query-Focused Summarization over Multi-Table Data [1.325953054381901]
We propose QueryTableSummarizer++, an end-to-end generative framework leveraging large language models (LLMs)
Our method eliminates the need for intermediate serialization steps and directly generates query-relevant summaries.
Experiments on a benchmark dataset demonstrate that QueryTableSummarizer++ significantly outperforms state-of-the-art baselines in terms of BLEU, ROUGE, and F1-score.
arXiv Detail & Related papers (2024-12-12T06:04:31Z) - GraphOTTER: Evolving LLM-based Graph Reasoning for Complex Table Question Answering [19.59852014700167]
Complex Table Question Answering involves providing accurate answers to specific questions based on intricate tables that exhibit complex layouts and flexible header locations.<n>We propose GraphOTTER that explicitly establishes the reasoning process to pinpoint the correct answers.<n>It then conducts step-by-step reasoning on the graph, with each step guided by a set of pre-defined intermediate reasoning actions.
arXiv Detail & Related papers (2024-12-02T07:49:23Z) - TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.<n>TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.<n>Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z) - QFMTS: Generating Query-Focused Summaries over Multi-Table Inputs [63.98556480088152]
Table summarization is a crucial task aimed at condensing information into concise and comprehensible textual summaries.
We propose a novel method to address these limitations by introducing query-focused multi-table summarization.
Our approach, which comprises a table serialization module, a summarization controller, and a large language model, generates query-dependent table summaries tailored to users' information needs.
arXiv Detail & Related papers (2024-05-08T15:05:55Z) - TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively.
It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z) - MultiTabQA: Generating Tabular Answers for Multi-Table Question
Answering [61.48881995121938]
Real-world queries are complex in nature, often over multiple tables in a relational database or web page.
Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers.
arXiv Detail & Related papers (2023-05-22T08:25:15Z) - Neural Graph Reasoning: Complex Logical Query Answering Meets Graph
Databases [63.96793270418793]
Complex logical query answering (CLQA) is a recently emerged task of graph machine learning.
We introduce the concept of Neural Graph Database (NGDBs)
NGDB consists of a Neural Graph Storage and a Neural Graph Engine.
arXiv Detail & Related papers (2023-03-26T04:03:37Z) - End-to-End Table Question Answering via Retrieval-Augmented Generation [19.89730342792824]
We introduce T-RAG, an end-to-end Table QA model, where a non-parametric dense vector index is fine-tuned jointly with BART, a parametric sequence-to-sequence model to generate answer tokens.
Given any natural language question, T-RAG utilizes a unified pipeline to automatically search through a table corpus to directly locate the correct answer from the table cells.
arXiv Detail & Related papers (2022-03-30T23:30:16Z) - TGRNet: A Table Graph Reconstruction Network for Table Structure
Recognition [76.06530816349763]
We propose an end-to-end trainable table graph reconstruction network (TGRNet) for table structure recognition.
Specifically, the proposed method has two main branches, a cell detection branch and a cell logical location branch, to jointly predict the spatial location and the logical location of different cells.
arXiv Detail & Related papers (2021-06-20T01:57:05Z) - Retrieving Complex Tables with Multi-Granular Graph Representation
Learning [20.72341939868327]
The task of natural language table retrieval seeks to retrieve semantically relevant tables based on natural language queries.
Existing learning systems treat tables as plain text based on the assumption that tables are structured as dataframes.
We propose Graph-based Table Retrieval (GTR), a generalizable NLTR framework with multi-granular graph representation learning.
arXiv Detail & Related papers (2021-05-04T20:19:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.