Scheduled Sampling in Vision-Language Pretraining with Decoupled
Encoder-Decoder Network
- URL: http://arxiv.org/abs/2101.11562v1
- Date: Wed, 27 Jan 2021 17:36:57 GMT
- Title: Scheduled Sampling in Vision-Language Pretraining with Decoupled
Encoder-Decoder Network
- Authors: Yehao Li and Yingwei Pan and Ting Yao and Jingwen Chen and Tao Mei
- Abstract summary: We propose a two-stream decoupled design of encoder-decoder structure, in which two decoupled cross-modal encoder and decoder are involved.
As an alternative, we propose a primary scheduled sampling strategy that mitigates such discrepancy via pretraining encoder-decoder in a two-pass manner.
- Score: 99.03895740754402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite having impressive vision-language (VL) pretraining with BERT-based
encoder for VL understanding, the pretraining of a universal encoder-decoder
for both VL understanding and generation remains challenging. The difficulty
originates from the inherently different peculiarities of the two disciplines,
e.g., VL understanding tasks capitalize on the unrestricted message passing
across modalities, while generation tasks only employ visual-to-textual message
passing. In this paper, we start with a two-stream decoupled design of
encoder-decoder structure, in which two decoupled cross-modal encoder and
decoder are involved to separately perform each type of proxy tasks, for
simultaneous VL understanding and generation pretraining. Moreover, for VL
pretraining, the dominant way is to replace some input visual/word tokens with
mask tokens and enforce the multi-modal encoder/decoder to reconstruct the
original tokens, but no mask token is involved when fine-tuning on downstream
tasks. As an alternative, we propose a primary scheduled sampling strategy that
elegantly mitigates such discrepancy via pretraining encoder-decoder in a
two-pass manner. Extensive experiments demonstrate the compelling
generalizability of our pretrained encoder-decoder by fine-tuning on four VL
understanding and generation downstream tasks. Source code is available at
\url{https://github.com/YehLi/TDEN}.
Related papers
- i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z) - Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for
Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets.
Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions.
We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - Trans-Encoder: Unsupervised sentence-pair modelling through self- and
mutual-distillations [22.40667024030858]
Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient.
Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance.
Trans-Encoder combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders.
arXiv Detail & Related papers (2021-09-27T14:06:47Z) - Parallel Refinements for Lexically Constrained Text Generation with BART [0.0]
We propose Constrained BART (CBART) for lexically constrained text generation.
CBART transfers part of the generation burden from the decoder to the encoder by decomposing this task into two sub-tasks, thereby improving the sentence quality.
Experiment results on One-Billion-Word and Yelp show that CBART can generate plausible text with high quality and diversity while significantly accelerating inference.
arXiv Detail & Related papers (2021-09-26T03:56:45Z) - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers.
Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z) - DeltaLM: Encoder-Decoder Pre-training for Language Generation and
Translation by Augmenting Pretrained Multilingual Encoders [92.90543340071007]
We introduce DeltaLM, a pretrained multilingual encoder-decoder model.
Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way.
Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks.
arXiv Detail & Related papers (2021-06-25T16:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.