OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and
Generation
- URL: http://arxiv.org/abs/2107.00249v1
- Date: Thu, 1 Jul 2021 06:59:44 GMT
- Title: OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and
Generation
- Authors: Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun,
Weining Wang, Jinqiao Wang, Hanqing Lu
- Abstract summary: We propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation.
OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality.
OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.
- Score: 52.037766778458504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose an Omni-perception Pre-Trainer (OPT) for
cross-modal understanding and generation, by jointly modeling visual, text and
audio resources. OPT is constructed in an encoder-decoder framework, including
three single-modal encoders to generate token-based embeddings for each
modality, a cross-modal encoder to encode the correlations among the three
modalities, and two cross-modal decoders to generate text and image
respectively. For the OPT's pre-training, we design a multi-task pretext
learning scheme to model multi-modal resources from three different data
granularities, \ie, token-, modality-, and sample-level modeling, through which
OPT learns to align and translate among different modalities. The pre-training
task is carried out on a large amount of image-text-audio triplets from Open
Images. Experimental results show that OPT can learn strong image-text-audio
multi-modal representations and achieve promising results on a variety of
cross-modal understanding and generation tasks.
Related papers
- MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - TT-BLIP: Enhancing Fake News Detection Using BLIP and Tri-Transformer [0.276240219662896]
This paper introduces an end-to-end model called TT-BLIP that applies the bootstrapping language-image pretraining for unified vision-image understanding and generation.
The experiments are performed using two fake news datasets, Weibo and Gossipcop.
arXiv Detail & Related papers (2024-03-19T06:36:42Z) - UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs.
It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z) - Emu: Generative Pretraining in Multimodality [43.759593451544546]
Transformer-based multimodal foundation model can seamlessly generate images and texts in multimodal context.
Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks.
Emu demonstrates superb performance compared to state-of-the-art large multimodal models.
arXiv Detail & Related papers (2023-07-11T12:45:39Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.