UnifieR: A Unified Retriever for Large-Scale Retrieval
- URL: http://arxiv.org/abs/2205.11194v2
- Date: Sun, 4 Jun 2023 12:59:36 GMT
- Title: UnifieR: A Unified Retriever for Large-Scale Retrieval
- Authors: Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Kai Zhang,
Daxin Jiang
- Abstract summary: Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
- Score: 84.61239936314597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale retrieval is to recall relevant documents from a huge collection
given a query. It relies on representation learning to embed documents and
queries into a common semantic encoding space. According to the encoding space,
recent retrieval methods based on pre-trained language models (PLM) can be
coarsely categorized into either dense-vector or lexicon-based paradigms. These
two paradigms unveil the PLMs' representation capability in different
granularities, i.e., global sequence-level compression and local word-level
contexts, respectively. Inspired by their complementary global-local
contextualization and distinct representing views, we propose a new learning
framework, UnifieR which unifies dense-vector and lexicon-based retrieval in
one model with a dual-representing capability. Experiments on passage retrieval
benchmarks verify its effectiveness in both paradigms. A uni-retrieval scheme
is further presented with even better retrieval quality. We lastly evaluate the
model on BEIR benchmark to verify its transferability.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - MINERS: Multilingual Language Models as Semantic Retrievers [23.686762008696547]
This paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual language models in semantic retrieval tasks.
We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages.
Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches.
arXiv Detail & Related papers (2024-06-11T16:26:18Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - BERM: Training the Balanced and Extractable Representation for Matching
to Improve Generalization Ability of Dense Retrieval [54.66399120084227]
We propose a novel method to improve the generalization of dense retrieval via capturing matching signal called BERM.
Dense retrieval has shown promise in the first-stage retrieval process when trained on in-domain labeled datasets.
arXiv Detail & Related papers (2023-05-18T15:43:09Z) - RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training
Retrieval-Oriented Language Models [12.37229805276939]
We propose a novel pre-training method called Duplex Masked Auto-Encoder, a.k.a. DupMAE.
It is designed to improve the quality semantic representation where all contextualized embeddings of the pretrained model can be leveraged.
arXiv Detail & Related papers (2023-05-04T05:37:22Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - A Proposed Conceptual Framework for a Representational Approach to
Information Retrieval [42.67826268399347]
This paper outlines a conceptual framework for understanding recent developments in information retrieval and natural language processing.
I propose a representational approach that breaks the core text retrieval problem into a logical scoring model and a physical retrieval model.
arXiv Detail & Related papers (2021-10-04T15:57:02Z) - Coarse-to-Fine Memory Matching for Joint Retrieval and Classification [0.7081604594416339]
We present a novel end-to-end language model for joint retrieval and classification.
We evaluate it on the standard blind test set of the FEVER fact verification dataset.
We extend exemplar auditing to this setting for analyzing and constraining the model.
arXiv Detail & Related papers (2020-11-29T05:06:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.