Related papers: Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval

Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval

URL: http://arxiv.org/abs/2105.03599v1
Date: Sat, 8 May 2021 05:28:24 GMT
Title: Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval
Authors: Hongyin Tang, Xingwu Sun, Beihong Jin, Jingang Wang, Fuzheng Zhang, Wei Wu
Abstract summary: We design a method to mimic the queries on each of the documents by an iterative clustering process. We also optimize the matching function with a two-step score calculation procedure. Experimental results on several popular ranking and QA datasets show that our model can achieve state-of-the-art results.
Score: 11.465218502487959
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, the retrieval models based on dense representations have been gradually applied in the first stage of the document retrieval tasks, showing better performance than traditional sparse vector space models. To obtain high efficiency, the basic structure of these models is Bi-encoder in most cases. However, this simple structure may cause serious information loss during the encoding of documents since the queries are agnostic. To address this problem, we design a method to mimic the queries on each of the documents by an iterative clustering process and represent the documents by multiple pseudo queries (i.e., the cluster centroids). To boost the retrieval process using approximate nearest neighbor search library, we also optimize the matching function with a two-step score calculation procedure. Experimental results on several popular ranking and QA datasets show that our model can achieve state-of-the-art results.

Related papers

Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER) DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z)
Quam: Adaptive Retrieval through Query Affinity Modelling [15.3583908068962]
Building relevance models to rank documents based on user information needs is a central task in information retrieval and the NLP community. We propose a unifying view of the nascent area of adaptive retrieval by proposing, Quam. Our proposed approach, Quam improves the recall performance by up to 26% over the standard re-ranking baselines.
arXiv Detail & Related papers (2024-10-26T22:52:12Z)
CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z)
Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions. Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Multi-View Document Representation Learning for Open-Domain Dense Retrieval [87.11836738011007]
This paper proposes a multi-view document representation learning framework. It aims to produce multi-view embeddings to represent documents and enforce them to align with different queries. Experiments show our method outperforms recent works and achieves state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T03:36:38Z)
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z)
CODER: An efficient framework for improving retrieval through COntextualized Document Embedding Reranking [11.635294568328625]
We present a framework for improving the performance of a wide class of retrieval models at minimal computational cost. It utilizes precomputed document representations extracted by a base dense retrieval method. It incurs a negligible computational overhead on top of any first-stage method at run time, allowing it to be easily combined with any state-of-the-art dense retrieval method.
arXiv Detail & Related papers (2021-12-16T10:25:26Z)
Value Retrieval with Arbitrary Queries for Form-like Documents [50.5532781148902]
We propose value retrieval with arbitrary queries for form-like documents. Our method predicts target value for an arbitrary query based on the understanding of layout and semantics of a form. We propose a simple document language modeling (simpleDLM) strategy to improve document understanding on large-scale model pre-training.
arXiv Detail & Related papers (2021-12-15T01:12:02Z)
Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback [29.719150565643965]
This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval. ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels. Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.
arXiv Detail & Related papers (2021-08-30T18:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.