AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with
Thirteen Modalities
- URL: http://arxiv.org/abs/2401.00546v2
- Date: Sun, 25 Feb 2024 09:18:20 GMT
- Title: AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with
Thirteen Modalities
- Authors: Run Shao, Cheng Yang, Qiujun Li, Qing Zhu, Yongjun Zhang, YanSheng Li,
Yu Liu, Yong Tang, Dapeng Liu, Shizhong Yang, Haifeng Li
- Abstract summary: We introduce the Language as Reference Framework (LaRF), a fundamental principle constructing a multimodal unified model.
We propose a multimodaltemporal general artificial intelligence model, called AllSpark.
Our model integrates thirteen different modalities into a unified framework, including 1D (text, code), 2D (RGB, oblique, infrared, multispectral, hyperspectral), tables, graphs, clouds, videos.
- Score: 21.735142277761383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For a long time, due to the high heterogeneity in structure and semantics
among various spatiotemporal modal data, the joint interpretation of multimodal
spatiotemporal data has been an extremely challenging problem. The primary
challenge resides in striking a trade-off between the cohesion and autonomy of
diverse modalities, and this trade-off exhibits a progressively nonlinear
nature as the number of modalities expands. We introduce the Language as
Reference Framework (LaRF), a fundamental principle for constructing a
multimodal unified model, aiming to strike a trade-off between the cohesion and
autonomy among different modalities. We propose a multimodal spatiotemporal
general artificial intelligence model, called AllSpark. Our model integrates
thirteen different modalities into a unified framework, including 1D (text,
code), 2D (RGB, infrared, SAR, multispectral, hyperspectral, tables, graphs,
trajectory, oblique photography), and 3D (point clouds, videos) modalities. To
achieve modal cohesion, AllSpark uniformly maps diverse modal features to the
language modality. In addition, we design modality-specific prompts to guide
multi-modal large language models in accurately perceiving multimodal data. To
maintain modality autonomy, AllSpark introduces modality-specific encoders to
extract the tokens of various spatiotemporal modalities. And modal bridge is
employed to achieve dimensional projection from each modality to the language
modality. Finally, observing a gap between the model's interpretation and
downstream tasks, we designed task heads to enhance the model's generalization
capability on specific downstream tasks. Experiments indicate that AllSpark
achieves competitive accuracy in modalities such as RGB and trajectory compared
to state-of-the-art models.
Related papers
- DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities [8.517830626176641]
Any2Seg is a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions.
Experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting.
arXiv Detail & Related papers (2024-07-16T03:34:38Z) - Towards a Generalist and Blind RGB-X Tracker [91.36268768952755]
We develop a single model tracker that can remain blind to any modality X during inference time.
Our training process is extremely simple, integrating multi-label classification loss with a routing function.
Our generalist and blind tracker can achieve competitive performance compared to well-established modal-specific models.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE)
MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - SimMMDG: A Simple and Effective Framework for Multi-modal Domain
Generalization [13.456240733175767]
SimMMDG is a framework to overcome the challenges of achieving domain generalization in multi-modal scenarios.
We employ supervised contrastive learning on the modality-shared features to ensure they possess joint properties and impose distance constraints.
Our framework is theoretically well-supported and achieves strong performance in multi-modal DG on the EPIC-Kitchens dataset and the novel Human-Animal-Cartoon dataset.
arXiv Detail & Related papers (2023-10-30T17:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.