HD-RAG: Retrieval-Augmented Generation for Hybrid Documents Containing Text and Hierarchical Tables
- URL: http://arxiv.org/abs/2504.09554v1
- Date: Sun, 13 Apr 2025 13:02:33 GMT
- Title: HD-RAG: Retrieval-Augmented Generation for Hybrid Documents Containing Text and Hierarchical Tables
- Authors: Chi Zhang, Qiyang Chen,
- Abstract summary: We introduce HD-RAG, a novel framework that incorporates a row-and-column level table representation.<n>We conduct comprehensive experiments with DocRAGLib, showing that HD-RAG outperforms existing baselines in both retrieval accuracy and QA performance.
- Score: 2.915799083273604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement of large language models (LLMs), Retrieval-Augmented Generation (RAG) effectively combines LLMs generative capabilities with external retrieval-based information. The Hybrid Document RAG task aims to integrate textual and hierarchical tabular data for more comprehensive retrieval and generation in complex scenarios. However, there is no existing dataset specifically designed for this task that includes both text and tabular data. Additionally, existing methods struggle to retrieve relevant tabular data and integrate it with text. Semantic similarity-based retrieval lacks accuracy, while table-specific methods fail to handle complex hierarchical structures effectively. Furthermore, the QA task requires complex reasoning and calculations, further complicating the challenge. In this paper, we propose a new large-scale dataset, DocRAGLib, specifically designed for the question answering (QA) task scenario under Hybrid Document RAG. To tackle these challenges, we introduce HD-RAG, a novel framework that incorporates a row-and-column level (RCL) table representation, employs a two-stage process combining ensemble and LLM-based retrieval, and integrates RECAP, which is designed for multi-step reasoning and complex calculations in Document-QA tasks. We conduct comprehensive experiments with DocRAGLib, showing that HD-RAG outperforms existing baselines in both retrieval accuracy and QA performance, demonstrating its effectiveness.
Related papers
- Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.
We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.
Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - Generative Retrieval for Book search [106.67655212825025]
We propose an effective Generative retrieval framework for Book Search.<n>It features two main components: data augmentation and outline-oriented book encoding.<n>Experiments on a proprietary Baidu dataset demonstrate that GBS outperforms strong baselines.
arXiv Detail & Related papers (2025-01-19T12:57:13Z) - KG-Retriever: Efficient Knowledge Indexing for Retrieval-Augmented Large Language Models [38.93603907879804]
We introduce a novel Knowledge Graph-based RAG framework with a hierarchical knowledge retriever, termed KG-Retriever.
The associative nature of graph structures is fully utilized to strengthen intra-document and inter-document connectivity.
With the coarse-grained collaborative information from neighboring documents and concise information from the knowledge graph, KG-Retriever achieves marked improvements on five public QA datasets.
arXiv Detail & Related papers (2024-12-07T05:49:14Z) - ConTReGen: Context-driven Tree-structured Retrieval for Open-domain Long-form Text Generation [26.4086456393314]
Long-form text generation requires coherent, comprehensive responses that address complex queries with both breadth and depth.
Existing iterative retrieval-augmented generation approaches often struggle to delve deeply into each facet of complex queries.
This paper introduces ConTReGen, a novel framework that employs a context-driven, tree-structured retrieval approach.
arXiv Detail & Related papers (2024-10-20T21:17:05Z) - DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering [4.364937306005719]
RAG has recently demonstrated the performance of Large Language Models (LLMs) in the knowledge-intensive tasks such as Question-Answering (QA)
We have found that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query.
A two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers.
arXiv Detail & Related papers (2024-06-11T15:15:33Z) - Multi-Head RAG: Solving Multi-Aspect Problems with LLMs [13.638439488923671]
Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs)
Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents.
This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea.
arXiv Detail & Related papers (2024-06-07T16:59:38Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA)
We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity.
We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z) - Beyond Extraction: Contextualising Tabular Data for Efficient
Summarisation by Language Models [0.0]
The conventional use of the Retrieval-Augmented Generation architecture has proven effective for retrieving information from diverse documents.
This research introduces an innovative approach to enhance the accuracy of complex table queries in RAG-based systems.
arXiv Detail & Related papers (2024-01-04T16:16:14Z) - Decomposing Complex Queries for Tip-of-the-tongue Retrieval [72.07449449115167]
Complex queries describe content elements (e.g., book characters or events), information beyond the document text.
This retrieval setting, called tip of the tongue (TOT), is especially challenging for models reliant on lexical and semantic overlap between query and document text.
We introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results.
arXiv Detail & Related papers (2023-05-24T11:43:40Z) - Mixed-modality Representation Learning and Pre-training for Joint
Table-and-Text Retrieval in OpenQA [85.17249272519626]
An optimized OpenQA Table-Text Retriever (OTTeR) is proposed.
We conduct retrieval-centric mixed-modality synthetic pre-training.
OTTeR substantially improves the performance of table-and-text retrieval on the OTT-QA dataset.
arXiv Detail & Related papers (2022-10-11T07:04:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.