Related papers: Multi-Vector Retrieval as Sparse Alignment

Multi-Vector Retrieval as Sparse Alignment

URL: http://arxiv.org/abs/2211.01267v1
Date: Wed, 2 Nov 2022 16:49:58 GMT
Title: Multi-Vector Retrieval as Sparse Alignment
Authors: Yujie Qian, Jinhyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma, Iftekhar Naim, Tao Lei, Vincent Y. Zhao
Abstract summary: We propose a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. Our model often produces interpretable alignments and significantly improves its performance when from larger language models.
Score: 21.892007741798853
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. `kind' from `kind of currency is used in new zealand}'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, AligneR scores 51.1 points nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (<= 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of AligneR helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models.

Related papers

Rational Retrieval Acts: Leveraging Pragmatic Reasoning to Improve Sparse Retrieval [29.652506774818267]
Current sparse neural information retrieval methods do not take into account the document collection and the complex interplay between different term weights when representing a single document.<n>We show how the Rational Speech Acts (RSA), a linguistics framework used to minimize the number of features to be communicated when identifying an object in a set, can be adapted to the IR case.<n>Experiments show that incorporating RSA consistently improves multiple sparse retrieval models and state-of-the-art performance on out-of-domain datasets.
arXiv Detail & Related papers (2025-05-06T16:21:10Z)
Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER) DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z)
Rethinking the Role of Token Retrieval in Multi-Vector Retrieval [22.508682857329912]
Multi-vector retrieval models such as ColBERT [Khattab and Zaharia, 2020] allow token-level interactions between queries and documents. We present XTR, ConteXtualized Token Retriever, which introduces a simple, yet novel, objective function that encourages the model to retrieve the most important document tokens first.
arXiv Detail & Related papers (2023-04-04T17:37:06Z)
Precise Zero-Shot Dense Retrieval without Relevance Labels [60.457378374671656]
Hypothetical Document Embeddings(HyDE) is a zero-shot dense retrieval system. We show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
arXiv Detail & Related papers (2022-12-20T18:09:52Z)
CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval [72.90850213615427]
Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers. These methods are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts. We propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval.
arXiv Detail & Related papers (2022-11-18T18:27:35Z)
Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions. Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Multi-View Document Representation Learning for Open-Domain Dense Retrieval [87.11836738011007]
This paper proposes a multi-view document representation learning framework. It aims to produce multi-view embeddings to represent documents and enforce them to align with different queries. Experiments show our method outperforms recent works and achieves state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T03:36:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.