Related papers: Precise Zero-Shot Dense Retrieval without Relevance Labels

Precise Zero-Shot Dense Retrieval without Relevance Labels

URL: http://arxiv.org/abs/2212.10496v1
Date: Tue, 20 Dec 2022 18:09:52 GMT
Title: Precise Zero-Shot Dense Retrieval without Relevance Labels
Authors: Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
Abstract summary: Hypothetical Document Embeddings(HyDE) is a zero-shot dense retrieval system. We show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
Score: 60.457378374671656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).

Related papers

Hierarchical corpus encoder: Fusing generative retrieval and dense indices [39.56098961341313]
Generative retrieval employs sequence models for conditional generation of document IDs based on a query. This has led to improved performance in zero-shot retrieval, but it is a challenge to support documents not seen during training. We propose a hierarchical corpus encoder (HCE) which can be supported by traditional dense encoders.
arXiv Detail & Related papers (2025-02-26T06:43:09Z)
Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER) DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
QAEncoder: Towards Aligned Representation Learning in Question Answering Systems [25.283922985211397]
QAEncoder is a training-free approach to bridge the gap between user queries and documents.<n>It estimates the expectation of potential queries in the embedding space as a robust surrogate for the document embedding, and attaches document fingerprints to distinguish these embeddings.<n>It offers a simple-yet-effective solution with zero additional index storage, retrieval latency, training costs, or catastrophic forgetting and hallucination issues.
arXiv Detail & Related papers (2024-09-30T15:53:38Z)
SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query. Existing methods such as similarity search and crossencoder models exhibit significant limitations. We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z)
Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text. Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references. We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR) While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
Noise-Robust Dense Retrieval via Contrastive Alignment Post Training [89.29256833403167]
Contrastive Alignment POst Training (CAPOT) is a highly efficient finetuning method that improves model robustness without requiring index regeneration. CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root. We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
arXiv Detail & Related papers (2023-04-06T22:16:53Z)
Multi-Vector Retrieval as Sparse Alignment [21.892007741798853]
We propose a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. Our model often produces interpretable alignments and significantly improves its performance when from larger language models.
arXiv Detail & Related papers (2022-11-02T16:49:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.