GeAR: Generation Augmented Retrieval
- URL: http://arxiv.org/abs/2501.02772v2
- Date: Fri, 30 May 2025 15:15:19 GMT
- Title: GeAR: Generation Augmented Retrieval
- Authors: Haoyu Liu, Shaohan Huang, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Furu Wei, Qi Zhang,
- Abstract summary: This paper introduces a novel method, $textbfGe$neration.<n>It improves the global document-Query similarity through contrastive learning, but also integrates well-designed fusion and decoding modules.<n>When used as a retriever, GeAR does not incur any additional computational cost over bi-encoders.
- Score: 82.20696567697016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document retrieval techniques are essential for developing large-scale information systems. The common approach involves using a bi-encoder to compute the semantic similarity between a query and documents. However, the scalar similarity often fail to reflect enough information, hindering the interpretation of retrieval results. In addition, this process primarily focuses on global semantics, overlooking the finer-grained semantic relationships between the query and the document's content. In this paper, we introduce a novel method, $\textbf{Ge}$neration $\textbf{A}$ugmented $\textbf{R}$etrieval ($\textbf{GeAR}$), which not only improves the global document-query similarity through contrastive learning, but also integrates well-designed fusion and decoding modules. This enables GeAR to generate relevant context within the documents based on a given query, facilitating learning to retrieve local fine-grained information. Furthermore, when used as a retriever, GeAR does not incur any additional computational cost over bi-encoders. GeAR exhibits competitive retrieval performance across diverse scenarios and tasks. Moreover, qualitative analysis and the results generated by GeAR provide novel insights into the interpretation of retrieval results. The code, data, and models will be released at \href{https://github.com/microsoft/LMOps}{https://github.com/microsoft/LMOps}.
Related papers
- Logical Consistency is Vital: Neural-Symbolic Information Retrieval for Negative-Constraint Queries [36.93438185371322]
Current dense retrievers retrieve the relevant documents within a corpus via embedding similarities.<n>We propose a neuro-symbolic information retrieval method, namely textbfNS-IR, to optimize the embeddings of naive natural language.<n>Our experiments demonstrate that NS-IR achieves superior zero-shot retrieval performance on web search and low-resource retrieval tasks.
arXiv Detail & Related papers (2025-05-28T12:37:09Z) - Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents [17.506934704019226]
standardized documents share similar formats such as repetitive boilerplate texts, and similar table structures.<n>This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness.<n>We propose the Hierarchical Retrieval with Evidence Curation framework to address these issues.
arXiv Detail & Related papers (2025-05-26T11:08:23Z) - What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond [32.467437657603604]
Repository-level code generation remains challenging due to complex code dependencies and the limitations of large language models (LLMs) in processing long contexts.
We propose AllianceCoder, a novel context-integrated method that employs chain-of-thought prompting to decompose user queries into implementation steps and retrieves APIs via semantic description matching.
Through extensive experiments on CoderEval and RepoExec, AllianceCoder achieves state-of-the-art performance, improving Pass@1 by up to 20% over existing approaches.
arXiv Detail & Related papers (2025-03-26T14:41:38Z) - Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG)
We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z) - QAEA-DR: A Unified Text Augmentation Framework for Dense Retrieval [12.225881591629815]
In dense retrieval, embedding long texts into dense vectors can result in information loss, leading to inaccurate query-text matching.
Recent studies mainly focus on improving the sentence embedding model or retrieval process.
We introduce a novel text augmentation framework for dense retrieval, which transforms raw documents into information-dense text formats.
arXiv Detail & Related papers (2024-07-29T17:39:08Z) - $\ exttt{MixGR}$: Enhancing Retriever Generalization for Scientific Domain through Complementary Granularity [88.78750571970232]
This paper introduces $texttMixGR$, which improves dense retrievers' awareness of query-document matching.
$texttMixGR$ fuses various metrics based on granularities to a united score that reflects a comprehensive query-document similarity.
arXiv Detail & Related papers (2024-07-15T13:04:09Z) - ReFusion: Improving Natural Language Understanding with Computation-Efficient Retrieval Representation Fusion [22.164620956284466]
Retrieval-based augmentations (RA) incorporating knowledge from an external database into language models have greatly succeeded in various knowledge-intensive (KI) tasks.
Existing works focus on concatenating retrievals with inputs to improve model performance.
This paper proposes a new paradigm of RA named textbfReFusion, a computation-efficient Retrieval representation Fusion with bi-level optimization.
arXiv Detail & Related papers (2024-01-04T07:39:26Z) - $\text{EFO}_{k}$-CQA: Towards Knowledge Graph Complex Query Answering
beyond Set Operation [36.77373013615789]
We propose a framework for data generation, model training, and method evaluation.
We construct a dataset, $textEFO_k$-CQA, with 741 types of query for empirical evaluation.
arXiv Detail & Related papers (2023-07-15T13:18:20Z) - Referral Augmentation for Zero-Shot Information Retrieval [30.811093210831018]
Referral-Augmented Retrieval (RAR) is a simple technique that links document indices with referrals.
RAR works with both sparse and dense retrievers, and outperforms generative text expansion techniques.
We analyze different methods for multi-referral aggregation and show that enables up-to-date information retrieval without re-training.
arXiv Detail & Related papers (2023-05-24T12:28:35Z) - CAPSTONE: Curriculum Sampling for Dense Retrieval with Document
Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query.
Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - A Proposed Conceptual Framework for a Representational Approach to
Information Retrieval [42.67826268399347]
This paper outlines a conceptual framework for understanding recent developments in information retrieval and natural language processing.
I propose a representational approach that breaks the core text retrieval problem into a logical scoring model and a physical retrieval model.
arXiv Detail & Related papers (2021-10-04T15:57:02Z) - Improving Query Representations for Dense Retrieval with Pseudo
Relevance Feedback [29.719150565643965]
This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval.
ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels.
Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.
arXiv Detail & Related papers (2021-08-30T18:10:26Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - Generation-Augmented Retrieval for Open-domain Question Answering [134.27768711201202]
Generation-Augmented Retrieval (GAR) for answering open-domain questions.
We show that generating diverse contexts for a query is beneficial as fusing their results consistently yields better retrieval accuracy.
GAR achieves state-of-the-art performance on Natural Questions and TriviaQA datasets under the extractive QA setup when equipped with an extractive reader.
arXiv Detail & Related papers (2020-09-17T23:08:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.