UnifieR: A Unified Retriever for Large-Scale Retrieval
- URL: http://arxiv.org/abs/2205.11194v2
- Date: Sun, 4 Jun 2023 12:59:36 GMT
- Title: UnifieR: A Unified Retriever for Large-Scale Retrieval
- Authors: Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Kai Zhang,
Daxin Jiang
- Abstract summary: Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
- Score: 84.61239936314597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale retrieval is to recall relevant documents from a huge collection
given a query. It relies on representation learning to embed documents and
queries into a common semantic encoding space. According to the encoding space,
recent retrieval methods based on pre-trained language models (PLM) can be
coarsely categorized into either dense-vector or lexicon-based paradigms. These
two paradigms unveil the PLMs' representation capability in different
granularities, i.e., global sequence-level compression and local word-level
contexts, respectively. Inspired by their complementary global-local
contextualization and distinct representing views, we propose a new learning
framework, UnifieR which unifies dense-vector and lexicon-based retrieval in
one model with a dual-representing capability. Experiments on passage retrieval
benchmarks verify its effectiveness in both paradigms. A uni-retrieval scheme
is further presented with even better retrieval quality. We lastly evaluate the
model on BEIR benchmark to verify its transferability.
Related papers
- GeAR: Generation Augmented Retrieval [82.20696567697016]
Document retrieval techniques form the foundation for the development of large-scale information systems.
The prevailing methodology is to construct a bi-encoder and compute the semantic similarity.
We propose a new method called $textbfGe$neration that incorporates well-designed fusion and decoding modules.
arXiv Detail & Related papers (2025-01-06T05:29:00Z) - Large Language Model Can Be a Foundation for Hidden Rationale-Based Retrieval [12.83513794686623]
In this paper, we propose and study a more challenging type of retrieval task, called hidden rationale retrieval.
To address such problems, an instruction-tuned Large language model (LLM) with a cross-encoder architecture could be a reasonable choice.
We name this retrieval framework by RaHoRe and verify its zero-shot and fine-tuned performance superiority on Emotional Support Conversation (ESC)
arXiv Detail & Related papers (2024-12-21T13:19:15Z) - Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training
Retrieval-Oriented Language Models [12.37229805276939]
We propose a novel pre-training method called Duplex Masked Auto-Encoder, a.k.a. DupMAE.
It is designed to improve the quality semantic representation where all contextualized embeddings of the pretrained model can be leveraged.
arXiv Detail & Related papers (2023-05-04T05:37:22Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - A Proposed Conceptual Framework for a Representational Approach to
Information Retrieval [42.67826268399347]
This paper outlines a conceptual framework for understanding recent developments in information retrieval and natural language processing.
I propose a representational approach that breaks the core text retrieval problem into a logical scoring model and a physical retrieval model.
arXiv Detail & Related papers (2021-10-04T15:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.