ONE-PEACE: Exploring One General Representation Model Toward Unlimited
Modalities
- URL: http://arxiv.org/abs/2305.11172v1
- Date: Thu, 18 May 2023 17:59:06 GMT
- Title: ONE-PEACE: Exploring One General Representation Model Toward Unlimited
Modalities
- Authors: Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren
Zhou, Xinggang Wang, Chang Zhou
- Abstract summary: We release ONE-PEACE, a highly model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities.
The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs.
With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities.
- Score: 71.15303690248021
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this work, we explore a scalable way for building a general representation
model toward unlimited modalities. We release ONE-PEACE, a highly extensible
model with 4B parameters that can seamlessly align and integrate
representations across vision, audio, and language modalities. The architecture
of ONE-PEACE comprises modality adapters, shared self-attention layers, and
modality FFNs. This design allows for the easy extension of new modalities by
adding adapters and FFNs, while also enabling multi-modal fusion through
self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic
pretraining tasks, cross-modal aligning contrast and intra-modal denoising
contrast, which align the semantic space of different modalities and capture
fine-grained details within modalities concurrently. With the scaling-friendly
architecture and pretraining tasks, ONE-PEACE has the potential to expand to
unlimited modalities. Without using any vision or language pretrained model for
initialization, ONE-PEACE achieves leading results on a wide range of uni-modal
and multi-modal tasks, including image classification (ImageNet), semantic
segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio
classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA),
image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g).
Code is available at https://github.com/OFA-Sys/ONE-PEACE.
Related papers
- From Unimodal to Multimodal: Scaling up Projectors to Align Modalities [16.733970553781887]
We propose a novel approach that aligns vision and language modalities using only projection layers on pretrained, frozen unimodal encoders.
Our method exploits the high semantic similarity between embedding spaces of well-trained vision and language models.
It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple projectors.
arXiv Detail & Related papers (2024-09-28T17:57:32Z) - Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities [8.517830626176641]
Any2Seg is a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions.
Experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting.
arXiv Detail & Related papers (2024-07-16T03:34:38Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - Image as a Foreign Language: BEiT Pretraining for All Vision and
Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3.
It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - Fashionformer: A simple, Effective and Unified Baseline for Human
Fashion Segmentation and Recognition [80.74495836502919]
In this work, we focus on joint human fashion segmentation and attribute recognition.
We introduce the object query for segmentation and the attribute query for attribute prediction.
For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
arXiv Detail & Related papers (2022-04-10T11:11:10Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.