Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
Vision-Language Pre-training
- URL: http://arxiv.org/abs/2201.04026v1
- Date: Tue, 11 Jan 2022 16:15:07 GMT
- Title: Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
Vision-Language Pre-training
- Authors: Yehao Li and Jiahao Fan and Yingwei Pan and Ting Yao and Weiyao Lin
and Tao Mei
- Abstract summary: We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation.
Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
- Score: 120.91411454661741
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training has been an emerging and fast-developing
research topic, which transfers multi-modal knowledge from rich-resource
pre-training task to limited-resource downstream tasks. Unlike existing works
that predominantly learn a single generic encoder, we present a pre-trainable
Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language
perception (e.g., visual question answering) and generation (e.g., image
captioning). Uni-EDEN is a two-stream Transformer based structure, consisting
of three modules: object and sentence encoders that separately learns the
representations of each modality, and sentence decoder that enables both
multi-modal reasoning and sentence generation via inter-modal interaction.
Considering that the linguistic representations of each image can span
different granularities in this hierarchy including, from simple to
comprehensive, individual label, a phrase, and a natural sentence, we pre-train
Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object
Classification (MOC), Masked Region Phrase Generation (MRPG), Image-Sentence
Matching (ISM), and Masked Sentence Generation (MSG). In this way, Uni-EDEN is
endowed with the power of both multi-modal representation extraction and
language modeling. Extensive experiments demonstrate the compelling
generalizability of Uni-EDEN by fine-tuning it to four vision-language
perception and generation downstream tasks.
Related papers
- Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z) - ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for
Understanding and Generation [40.376625939658354]
ERNIE-UniX2 is a unified cross-lingual pre-training framework for both generation and understanding tasks.
ERNIE-UniX2 integrates multiple pre-training paradigms based on encoder-decoder architecture.
ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks.
arXiv Detail & Related papers (2022-11-09T13:06:58Z) - Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks [39.12025963907317]
Unified-IO is a model that performs a large variety of AI tasks spanning classical computer vision tasks.
We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens.
Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark.
arXiv Detail & Related papers (2022-06-17T17:53:47Z) - UFO: A UniFied TransfOrmer for Vision-Language Representation Learning [54.82482779792115]
We propose a single UniFied transfOrmer (UFO) capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question) for vision-language (VL) representation learning.
Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks.
arXiv Detail & Related papers (2021-11-19T03:23:10Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.