i-Code: An Integrative and Composable Multimodal Learning Framework
- URL: http://arxiv.org/abs/2205.01818v2
- Date: Thu, 5 May 2022 06:35:23 GMT
- Title: i-Code: An Integrative and Composable Multimodal Learning Framework
- Authors: Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu
Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, Liyang Lu, Yujia Xie,
Robert Gmyr, Noel Codella, Naoyuki Kanda, Bin Xiao, Lu Yuan, Takuya Yoshioka,
Michael Zeng, Xuedong Huang
- Abstract summary: i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
- Score: 99.56065789066027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human intelligence is multimodal; we integrate visual, linguistic, and
acoustic signals to maintain a holistic worldview. Most current pretraining
methods, however, are limited to one or two modalities. We present i-Code, a
self-supervised pretraining framework where users may flexibly combine the
modalities of vision, speech, and language into unified and general-purpose
vector representations. In this framework, data from each modality are first
given to pretrained single-modality encoders. The encoder outputs are then
integrated with a multimodal fusion network, which uses novel attention
mechanisms and other architectural innovations to effectively combine
information from the different modalities. The entire system is pretrained
end-to-end with new objectives including masked modality unit modeling and
cross-modality contrastive learning. Unlike previous research using only video
for pretraining, the i-Code framework can dynamically process single, dual, and
triple-modality data during training and inference, flexibly projecting
different combinations of modalities into a single representation space.
Experimental results demonstrate how i-Code can outperform state-of-the-art
techniques on five video understanding tasks and the GLUE NLP benchmark,
improving by as much as 11% and demonstrating the power of integrative
multimodal pretraining.
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z) - FormNetV2: Multimodal Graph Contrastive Learning for Form Document
Information Extraction [43.17713130538514]
We introduce a centralized graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss.
FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
arXiv Detail & Related papers (2023-05-04T05:02:04Z) - Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information [77.80071279597665]
We propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training)
Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation.
arXiv Detail & Related papers (2022-11-17T18:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.