Trans-Encoder: Unsupervised sentence-pair modelling through self- and
mutual-distillations
- URL: http://arxiv.org/abs/2109.13059v2
- Date: Tue, 28 Sep 2021 15:55:44 GMT
- Title: Trans-Encoder: Unsupervised sentence-pair modelling through self- and
mutual-distillations
- Authors: Fangyu Liu, Yunlong Jiao, Jordan Massiah, Emine Yilmaz, Serhii
Havrylov
- Abstract summary: Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient.
Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance.
Trans-Encoder combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders.
- Score: 22.40667024030858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In NLP, a large volume of tasks involve pairwise comparison between two
sequences (e.g. sentence similarity and paraphrase identification).
Predominantly, two formulations are used for sentence-pair tasks: bi-encoders
and cross-encoders. Bi-encoders produce fixed-dimensional sentence
representations and are computationally efficient, however, they usually
underperform cross-encoders. Cross-encoders can leverage their attention heads
to exploit inter-sentence interactions for better performance but they require
task fine-tuning and are computationally more expensive. In this paper, we
present a completely unsupervised sentence representation model termed as
Trans-Encoder that combines the two learning paradigms into an iterative joint
framework to simultaneously learn enhanced bi- and cross-encoders.
Specifically, on top of a pre-trained Language Model (PLM), we start with
converting it to an unsupervised bi-encoder, and then alternate between the bi-
and cross-encoder task formulations. In each alternation, one task formulation
will produce pseudo-labels which are used as learning signals for the other
task formulation. We then propose an extension to conduct such
self-distillation approach on multiple PLMs in parallel and use the average of
their pseudo-labels for mutual-distillation. Trans-Encoder creates, to the best
of our knowledge, the first completely unsupervised cross-encoder and also a
state-of-the-art unsupervised bi-encoder for sentence similarity. Both the
bi-encoder and cross-encoder formulations of Trans-Encoder outperform recently
proposed state-of-the-art unsupervised sentence encoders such as Mirror-BERT
and SimCSE by up to 5% on the sentence similarity benchmarks.
Related papers
- How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval? [99.87554379608224]
Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal.
Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance.
We propose a novel Contrastive Partial Ranking Distillation (DCPR) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning.
arXiv Detail & Related papers (2024-07-10T09:10:01Z) - CrossMPT: Cross-attention Message-Passing Transformer for Error Correcting Codes [14.631435001491514]
We propose a novel Cross-attention Message-Passing Transformer (CrossMPT)
We show that CrossMPT significantly outperforms existing neural network-based decoders for various code classes.
Notably, CrossMPT achieves this decoding performance improvement, while significantly reducing the memory usage, complexity, inference time, and training time.
arXiv Detail & Related papers (2024-05-02T06:30:52Z) - Triple-Encoders: Representations That Fire Together, Wire Together [51.15206713482718]
Contrastive Learning is a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder.
This study introduces triple-encoders, which efficiently compute distributed utterance mixtures from these independently encoded utterances.
We find that triple-encoders lead to a substantial improvement over bi-encoders, and even to better zero-shot generalization than single-vector representation models.
arXiv Detail & Related papers (2024-02-19T18:06:02Z) - Lossless Acceleration for Seq2seq Generation with Aggressive Decoding [74.12096349944497]
Aggressive Decoding is a novel decoding algorithm for seq2seq generation.
Our approach aims to yield identical (or better) generation compared with autoregressive decoding.
We test Aggressive Decoding on the most popular 6-layer Transformer model on GPU in multiple seq2seq tasks.
arXiv Detail & Related papers (2022-05-20T17:59:00Z) - ConvFiT: Conversational Fine-Tuning of Pretrained Language Models [42.7160113690317]
Transformer-based language models (LMs) pretrained on large text collections are proven to store a wealth of semantic knowledge.
We propose ConvFiT, a simple and efficient two-stage procedure which turns any pretrained LM into a universal conversational encoder.
arXiv Detail & Related papers (2021-09-21T12:16:56Z) - Transformer Based Deliberation for Two-Pass Speech Recognition [46.86118010771703]
Speech recognition systems must generate words quickly while also producing accurate results.
Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate.
Previous work has established that a deliberation network can be an effective second-pass model.
arXiv Detail & Related papers (2021-01-27T18:05:22Z) - Scheduled Sampling in Vision-Language Pretraining with Decoupled
Encoder-Decoder Network [99.03895740754402]
We propose a two-stream decoupled design of encoder-decoder structure, in which two decoupled cross-modal encoder and decoder are involved.
As an alternative, we propose a primary scheduled sampling strategy that mitigates such discrepancy via pretraining encoder-decoder in a two-pass manner.
arXiv Detail & Related papers (2021-01-27T17:36:57Z) - Fast Interleaved Bidirectional Sequence Generation [90.58793284654692]
We introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously.
We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder.
Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer.
arXiv Detail & Related papers (2020-10-27T17:38:51Z) - Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for
Pairwise Sentence Scoring Tasks [59.13635174016506]
We present a simple yet efficient data augmentation strategy called Augmented SBERT.
We use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder.
We show that, in this process, selecting the sentence pairs is non-trivial and crucial for the success of the method.
arXiv Detail & Related papers (2020-10-16T08:43:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.