Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception
- URL: http://arxiv.org/abs/2601.15643v1
- Date: Thu, 22 Jan 2026 04:45:28 GMT
- Title: Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception
- Authors: Bo Yuan, Danpei Zhao, Wentao Li, Tian Li, Zhiguo Jiang,
- Abstract summary: Continual learning (CL) is a great endeavour in developing intelligent perception AI systems.<n>We extend CL to continual panoptic perception ( CPP) to enhance comprehensive image perception through pixel-level, instance-level, and image-level joint interpretation.<n>Our proposed model incorporates an asymmetric pseudo-labeling manner, enabling model evolving without exemplar replay.
- Score: 17.590466606165094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continual learning (CL) is a great endeavour in developing intelligent perception AI systems. However, the pioneer research has predominantly focus on single-task CL, which restricts the potential in multi-task and multimodal scenarios. Beyond the well-known issue of catastrophic forgetting, the multi-task CL also brings semantic obfuscation across multimodal alignment, leading to severe model degradation during incremental training steps. In this paper, we extend CL to continual panoptic perception (CPP), integrating multimodal and multi-task CL to enhance comprehensive image perception through pixel-level, instance-level, and image-level joint interpretation. We formalize the CL task in multimodal scenarios and propose an end-to-end continual panoptic perception model. Concretely, CPP model features a collaborative cross-modal encoder (CCE) for multimodal embedding. We also propose a malleable knowledge inheritance module via contrastive feature distillation and instance distillation, addressing catastrophic forgetting from task-interactive boosting manner. Furthermore, we propose a cross-modal consistency constraint and develop CPP+, ensuring multimodal semantic alignment for model updating under multi-task incremental scenarios. Additionally, our proposed model incorporates an asymmetric pseudo-labeling manner, enabling model evolving without exemplar replay. Extensive experiments on multimodal datasets and diverse CL tasks demonstrate the superiority of the proposed model, particularly in fine-grained CL tasks.
Related papers
- Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z) - From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model [49.65930977591188]
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks.<n>We introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design.<n>Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency.<n>The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings.
arXiv Detail & Related papers (2025-10-14T16:43:22Z) - Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives [61.58574200236532]
Adversarial examples generated from fine-grained tasks often exhibit stronger transfer potential than those from coarse-grained tasks.<n>We propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability.
arXiv Detail & Related papers (2025-09-28T14:46:52Z) - Progressive Semantic Residual Quantization for Multimodal-Joint Interest Modeling in Music Recommendation [6.790539226766362]
We propose a novel multimodal recommendation framework with two stages.<n>In the first stage, our method generates modal-specific and modal-joint semantic IDs.<n>In the second stage, to model multimodal interest of users, a Multi-Codebook Cross-Attention network is designed.
arXiv Detail & Related papers (2025-08-28T02:16:57Z) - Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations [0.0]
Multimodal in-context learning (ICL) has emerged as a key capability of Large Vision-Language Models (LVLMs)<n>We shed light on the core mechanism underlying multimodal ICL, identifying task mapping as a crucial factor in configuring robust in-context demonstration sequences.<n>We propose textitSabER, a lightweight yet powerful decoder-only transformer equipped with task-aware attention.
arXiv Detail & Related papers (2025-03-05T16:33:10Z) - Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering [53.39158264785098]
Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task.
We present an entirely end-to-end solution for VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation model.
arXiv Detail & Related papers (2024-10-12T06:21:58Z) - Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images [16.0258685984844]
Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously.
We propose a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception.
arXiv Detail & Related papers (2024-07-19T12:22:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.