OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
- URL: http://arxiv.org/abs/2509.15096v1
- Date: Thu, 18 Sep 2025 15:52:44 GMT
- Title: OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
- Authors: Bo-Wen Yin, Jiao-Long Cao, Xuying Zhang, Yuming Chen, Ming-Ming Cheng, Qibin Hou,
- Abstract summary: We propose a novel multi-modal learning framework, termed OmniSegmentor.<n>Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called ImageNeXt.<n>We introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios.
- Score: 74.55725909072903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi-modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called ImageNeXt, which contains five popular visual modalities. 2) We provide an efficient pretraining manner to endow the model with the capacity to encode different modality information in the ImageNeXt. For the first time, we introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios, regardless of the arbitrary combination of the involved modalities. Remarkably, our OmniSegmentor achieves new state-of-the-art records on a wide range of multi-modal semantic segmentation datasets, including NYU Depthv2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI-360.
Related papers
- FusionSAM: Visual Multi-Modal Learning with Segment Anything [37.61598617788102]
We introduce the Segment Anything Model (SAM) into multimodal image segmentation for the first time.<n>We propose a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules.<n>Our method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios.
arXiv Detail & Related papers (2024-08-26T02:20:55Z) - Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities [8.517830626176641]
Any2Seg is a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions.
Experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting.
arXiv Detail & Related papers (2024-07-16T03:34:38Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.<n>SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.<n>We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - Noise-powered Multi-modal Knowledge Graph Representation Framework [52.95468915728721]
The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph representation learning framework.<n>We propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking.<n>Our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility.
arXiv Detail & Related papers (2024-03-11T15:48:43Z) - Semantic-SAM: Segment and Recognize Anything at Any Granularity [83.64686655044765]
We introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity.
We consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts.
For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels.
arXiv Detail & Related papers (2023-07-10T17:59:40Z) - ONE-PEACE: Exploring One General Representation Model Toward Unlimited
Modalities [71.15303690248021]
We release ONE-PEACE, a highly model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities.
The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs.
With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities.
arXiv Detail & Related papers (2023-05-18T17:59:06Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.