RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training
Retrieval-Oriented Language Models
- URL: http://arxiv.org/abs/2305.02564v1
- Date: Thu, 4 May 2023 05:37:22 GMT
- Title: RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training
Retrieval-Oriented Language Models
- Authors: Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao
- Abstract summary: We propose a novel pre-training method called Duplex Masked Auto-Encoder, a.k.a. DupMAE.
It is designed to improve the quality semantic representation where all contextualized embeddings of the pretrained model can be leveraged.
- Score: 12.37229805276939
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To better support information retrieval tasks such as web search and
open-domain question answering, growing effort is made to develop
retrieval-oriented language models, e.g., RetroMAE and many others. Most of the
existing works focus on improving the semantic representation capability for
the contextualized embedding of the [CLS] token. However, recent study shows
that the ordinary tokens besides [CLS] may provide extra information, which
help to produce a better representation effect. As such, it's necessary to
extend the current methods where all contextualized embeddings can be jointly
pre-trained for the retrieval tasks. In this work, we propose a novel
pre-training method called Duplex Masked Auto-Encoder, a.k.a. DupMAE. It is
designed to improve the quality of semantic representation where all
contextualized embeddings of the pre-trained model can be leveraged. It takes
advantage of two complementary auto-encoding tasks: one reconstructs the input
sentence on top of the [CLS] embedding; the other one predicts the bag-of-words
feature of the input sentence based on the ordinary tokens' embeddings. The two
tasks are jointly conducted to train a unified encoder, where the whole
contextualized embeddings are aggregated in a compact way to produce the final
semantic representation. DupMAE is simple but empirically competitive: it
substantially improves the pre-trained model's representation capability and
transferability, where superior retrieval performances can be achieved on
popular benchmarks, like MS MARCO and BEIR.
Related papers
- CoT-MAE v2: Contextual Masked Auto-Encoder with Multi-view Modeling for
Passage Retrieval [34.08763911138496]
This study brings multi-view modeling to the contextual masked auto-encoder.
We refer to this multi-view pretraining method as CoT-MAE v2.
arXiv Detail & Related papers (2023-04-05T08:00:38Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training
Retrieval-Oriented Language Models [3.4523793651427113]
We propose duplex masked auto-encoder, a.k.a. DupMAE, which targets on improving the semantic representation capacity for contextualized embeddings of both [] and ordinary tokens.
DupMAE is simple but empirically competitive: with a small decoding cost, it substantially contributes to the model's representation capability and transferability.
arXiv Detail & Related papers (2022-11-16T08:57:55Z) - ConTextual Mask Auto-Encoder for Dense Passage Retrieval [49.49460769701308]
CoT-MAE is a simple yet effective generative pre-training method for dense passage retrieval.
It learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding.
We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines.
arXiv Detail & Related papers (2022-08-16T11:17:22Z) - Masked Autoencoders As The Unified Learners For Pre-Trained Sentence
Representation [77.47617360812023]
We extend the recently proposed MAE style pre-training strategy, RetroMAE, to support a wide variety of sentence representation tasks.
The first stage performs RetroMAE over generic corpora, like Wikipedia, BookCorpus, etc., from which the base model is learned.
The second stage takes place on domain-specific data, e.g., MS MARCO and NLI, where the base model is continuingly trained based on RetroMAE and contrastive learning.
arXiv Detail & Related papers (2022-07-30T14:34:55Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - COCO-LM: Correcting and Contrasting Text Sequences for Language Model
Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z) - Automated Concatenation of Embeddings for Structured Prediction [75.44925576268052]
We propose Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks.
We follow strategies in reinforcement learning to optimize the parameters of the controller and compute the reward based on the accuracy of a task model.
arXiv Detail & Related papers (2020-10-10T14:03:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.