i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data
- URL: http://arxiv.org/abs/2305.12311v1
- Date: Sun, 21 May 2023 01:25:44 GMT
- Title: i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data
- Authors: Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang,
Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr,
Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka,
Michael Zeng, Xuedong Huang
- Abstract summary: i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
- Score: 101.52821120195975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The convergence of text, visual, and audio data is a key step towards
human-like artificial intelligence, however the current Vision-Language-Speech
landscape is dominated by encoder-only models which lack generative abilities.
We propose closing this gap with i-Code V2, the first model capable of
generating natural language from any combination of Vision, Language, and
Speech data. i-Code V2 is an integrative system that leverages state-of-the-art
single-modality encoders, combining their outputs with a new modality-fusing
encoder in order to flexibly project combinations of modalities into a shared
representational space. Next, language tokens are generated from these
representations via an autoregressive decoder. The whole framework is
pretrained end-to-end on a large collection of dual- and single-modality
datasets using a novel text completion objective that can be generalized across
arbitrary combinations of modalities. i-Code V2 matches or outperforms
state-of-the-art single- and dual-modality baselines on 7 multimodal tasks,
demonstrating the power of generative multimodal pretraining across a diversity
of tasks and signals.
Related papers
- VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
Vision-Language Pre-training [120.91411454661741]
We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation.
Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
arXiv Detail & Related papers (2022-01-11T16:15:07Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.