CMIP-CIL: A Cross-Modal Benchmark for Image-Point Class Incremental Learning
- URL: http://arxiv.org/abs/2504.08422v1
- Date: Fri, 11 Apr 2025 10:28:29 GMT
- Title: CMIP-CIL: A Cross-Modal Benchmark for Image-Point Class Incremental Learning
- Authors: Chao Qi, Jianqin Yin, Ren Zhang,
- Abstract summary: Image-point class incremental learning helps the 3D-points-vision robots continually learn category knowledge from 2D images.<n>We first explore this cross-modal task, proposing a benchmark CMIP-CIL and relieving the cross-modal catastrophic forgetting problem.<n>It employs masked point clouds and rendered multi-view images within a contrastive learning framework in pre-training, empowering the vision model with the generalizations of image-point correspondence.
- Score: 10.936166435599572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-point class incremental learning helps the 3D-points-vision robots continually learn category knowledge from 2D images, improving their perceptual capability in dynamic environments. However, some incremental learning methods address unimodal forgetting but fail in cross-modal cases, while others handle modal differences within training/testing datasets but assume no modal gaps between them. We first explore this cross-modal task, proposing a benchmark CMIP-CIL and relieving the cross-modal catastrophic forgetting problem. It employs masked point clouds and rendered multi-view images within a contrastive learning framework in pre-training, empowering the vision model with the generalizations of image-point correspondence. In the incremental stage, by freezing the backbone and promoting object representations close to their respective prototypes, the model effectively retains and generalizes knowledge across previously seen categories while continuing to learn new ones. We conduct comprehensive experiments on the benchmark datasets. Experiments prove that our method achieves state-of-the-art results, outperforming the baseline methods by a large margin.
Related papers
- Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations? [55.99654128127689]
Cross-modal contrastive distillation has recently been explored for learning effective 3D representations.<n>Existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process.<n>We propose a new framework, namely CMCR, to address these shortcomings.
arXiv Detail & Related papers (2024-12-12T06:09:49Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Continual Vision-Language Representation Learning with Off-Diagonal
Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z) - Vision Learners Meet Web Image-Text Pairs [32.36188289972377]
In this work, we consider self-supervised pre-training on noisy web sourced image-text paired data.
We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training.
We present a new visual representation pre-training method, MUlti-modal Generator(MUG), that learns from scalable web sourced image-text data.
arXiv Detail & Related papers (2023-01-17T18:53:24Z) - Let Images Give You More:Point Cloud Cross-Modal Training for Shape
Analysis [43.13887916301742]
This paper introduces a simple but effective point cloud cross-modality training (PointCMT) strategy to boost point cloud analysis.
To effectively acquire auxiliary knowledge from view images, we develop a teacher-student framework and formulate the cross modal learning as a knowledge distillation problem.
We verify significant gains on various datasets using appealing backbones, i.e., equipped with PointCMT, PointNet++ and PointMLP.
arXiv Detail & Related papers (2022-10-09T09:35:22Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Rectifying the Shortcut Learning of Background: Shared Object
Concentration for Few-Shot Image Recognition [101.59989523028264]
Few-Shot image classification aims to utilize pretrained knowledge learned from a large-scale dataset to tackle a series of downstream classification tasks.
We propose COSOC, a novel Few-Shot Learning framework, to automatically figure out foreground objects at both pretraining and evaluation stage.
arXiv Detail & Related papers (2021-07-16T07:46:41Z) - Multimodal Contrastive Training for Visual Representation Learning [45.94662252627284]
We develop an approach to learning visual representations that embraces multimodal data.
Our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously.
By including multimodal training in a unified framework, our method can learn more powerful and generic visual features.
arXiv Detail & Related papers (2021-04-26T19:23:36Z) - Learning View-Disentangled Human Pose Representation by Contrastive
Cross-View Mutual Information Maximization [33.36330493757669]
We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses.
The method trains a network using cross-view mutual information (CV-MIM) which maximizes mutual information of the same pose performed from different viewpoints.
CV-MIM outperforms other competing methods by a large margin in the single-shot cross-view setting.
arXiv Detail & Related papers (2020-12-02T18:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.