Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval
- URL: http://arxiv.org/abs/2507.21489v1
- Date: Tue, 29 Jul 2025 04:11:05 GMT
- Title: Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval
- Authors: Zhichuan Wang, Yang Zhou, Zhe Liu, Rui Yu, Song Bai, Yulong Wang, Xinwei He, Xiang Bai,
- Abstract summary: Open-set 3D object retrieval is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set.<n>Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion.<n>We present a framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR.
- Score: 76.86914849263168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP's training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01\% mAP on four open-set 3DOR datasets. Moreover, its generalization is also validated on image-based and cross-dataset setups. Code is available at https://github.com/wangzhichuan123/DAC.
Related papers
- CLIP3D-AD: Extending CLIP for 3D Few-Shot Anomaly Detection with Multi-View Images Generation [22.850815902535988]
We propose CLIP3D-AD, an efficient 3D-FSAD method extended on CLIP.
We synthesize anomalous images on given normal images as sample pairs to adapt CLIP for 3D anomaly classification and segmentation.
Our method has a competitive performance of 3D few-shot anomaly classification and segmentation on MVTec-3D AD dataset.
arXiv Detail & Related papers (2024-06-27T07:13:09Z) - Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels [69.55622471172941]
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models.
We propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model.
arXiv Detail & Related papers (2024-04-15T21:30:50Z) - TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters.
TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z) - CLIPose: Category-Level Object Pose Estimation with Pre-trained
Vision-Language Knowledge [18.57081150228812]
We propose a novel 6D pose framework that employs the pre-trained vision-language model to develop better learning of object category information.
CLIPose achieves state-of-the-art performance on two mainstream benchmark datasets, REAL275 and CAMERA25, and runs in real-time during inference (40FPS)
arXiv Detail & Related papers (2024-02-24T05:31:53Z) - GS-CLIP: Gaussian Splatting for Contrastive Language-Image-3D
Pretraining from Real-World Data [73.06536202251915]
3D Shape represented as point cloud has achieve advancements in multimodal pre-training to align image and language descriptions.
We propose GS-CLIP for the first attempt to introduce 3DGS into multimodal pre-training to enhance 3D representation.
arXiv Detail & Related papers (2024-02-09T05:46:47Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z) - Multiview Compressive Coding for 3D Reconstruction [77.95706553743626]
We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
arXiv Detail & Related papers (2023-01-19T18:59:52Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.