BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression
- URL: http://arxiv.org/abs/2410.15277v2
- Date: Sat, 15 Feb 2025 21:41:43 GMT
- Title: BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression
- Authors: Yuankai Li, Jia-Chen Gu, Di Wu, Kai-Wei Chang, Nanyun Peng,
- Abstract summary: Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge.
This paper presents BRIEF, a lightweight approach that performs query-aware multi-hop reasoning.
Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
- Score: 91.23933111083389
- License:
- Abstract: Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context RAG. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic propositions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader model. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.
Related papers
- Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.
This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.
Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z) - RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation [21.764973680014368]
RetroLLM is a unified framework that integrates retrieval and generation into a single, cohesive process.
To mitigate false pruning in the process of constrained evidence generation, we introduce hierarchical FM-Index constraints.
Experiments on five open-domain QA datasets demonstrate RetroLLM's superior performance across both in-domain and out-of-domain tasks.
arXiv Detail & Related papers (2024-12-16T16:03:25Z) - Two are better than one: Context window extension with multi-grained self-injection [111.1376461868317]
SharedLLM is a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval.
We introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks.
arXiv Detail & Related papers (2024-10-25T06:08:59Z) - Efficient Document Ranking with Learnable Late Interactions [73.41976017860006]
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval.
To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings.
Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer.
arXiv Detail & Related papers (2024-06-25T22:50:48Z) - Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection [28.15184715270483]
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility.
We propose a novel paradigm named Sparse RAG, which seeks to cut costs through sparsity.
Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents.
arXiv Detail & Related papers (2024-05-25T11:10:04Z) - Context-aware Decoding Reduces Hallucination in Query-focused
Summarization [2.8554857235549753]
We conduct a large-scale study on one recently proposed decoding method -- Context-aware Decoding (CAD)
Experiments with eight different language models show that performance-wise, CAD improves QFS quality by reducing factuality errors/hallucinations.
The code implementation based on Huggingface Library is made available.
arXiv Detail & Related papers (2023-12-21T23:42:13Z) - Extending Context Window of Large Language Models via Semantic
Compression [21.35020344956721]
Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses.
We propose a novel semantic compression method that enables generalization to texts 6-8 times longer, without incurring significant computational costs or requiring fine-tuning.
arXiv Detail & Related papers (2023-12-15T07:04:33Z) - Self-prompted Chain-of-Thought on Large Language Models for Open-domain
Multi-hop Reasoning [70.74928578278957]
In open-domain question-answering (ODQA), most existing questions require single-hop reasoning on commonsense.
Large language models (LLMs) have found significant utility in facilitating ODQA without external corpus.
We propose Self-prompted Chain-of-Thought (SP-CoT), an automated framework to mass-produce high quality CoTs.
arXiv Detail & Related papers (2023-10-20T14:51:10Z) - RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective
Augmentation [61.53695868960846]
We propose compressing retrieved documents into textual summaries prior to in-context integration.
This not only reduces the computational costs but also relieves the burden of LMs to identify relevant information in long retrieved documents.
We show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide summaries largely faithful to the retrieved documents.
arXiv Detail & Related papers (2023-10-06T17:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.