OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
- URL: http://arxiv.org/abs/2407.11895v1
- Date: Tue, 16 Jul 2024 16:24:31 GMT
- Title: OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
- Authors: Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao,
- Abstract summary: We present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters.
Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together.
Experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications.
- Score: 67.07083389543799
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters, which support 3D, audio, image, and language inputs. Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together. This approach enables "scaling up" by indirectly increasing the model parameters and the amount of seen data. To effectively integrate various spaces, we dynamically assign weights to different spaces by learning routers with two objectives: cross-modal overall alignment and language representation decoupling. Notably, since binding and routing spaces both only require lightweight networks, OmniBind is extremely training-efficient. Learning the largest 30B model requires merely unpaired unimodal data and approximately 3 days on a single 8-4090 node. Extensive experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding.
Related papers
- Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations [16.036997801745905]
Multimodal learning plays a crucial role in enabling machine learning models to fuse and utilize diverse data sources.
Recent binding methods, such as ImageBind, typically use a fixed anchor modality to align multimodal data in the anchor modal embedding space.
We propose CentroBind, a simple yet powerful approach that eliminates the need for a fixed anchor.
arXiv Detail & Related papers (2024-10-02T23:19:23Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts.
To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.
SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with
Optimal Transport [24.930976128926314]
We propose a novel Multi-modal Multi-instance Multi-label Deep Network (M3DN)
M3DN considers M3 learning in an end-to-end multi-modal deep network and utilizes consistency principle among different modal bag-level predictions.
Thereby M3DNS can better predict label and exploit label correlation simultaneously.
arXiv Detail & Related papers (2021-04-17T09:18:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.