ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models
- URL: http://arxiv.org/abs/2412.06292v1
- Date: Mon, 09 Dec 2024 08:31:57 GMT
- Title: ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models
- Authors: Bingchen Gong, Diego Gomez, Abdullah Hamdi, Abdelrahman Eldesokey, Ahmed Abdelreheem, Peter Wonka, Maks Ovsjanikov,
- Abstract summary: We propose a novel zero-shot approach for keypoint detection on 3D shapes.
Our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models.
- Score: 57.57832348655715
- License:
- Abstract: We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to supervised methods, despite not requiring any 3D keypoint annotations during training. Our results highlight the potential of integrating language models for localized 3D shape understanding. This work opens new avenues for cross-modal learning and underscores the effectiveness of MLLMs in contributing to 3D computer vision challenges.
Related papers
- CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds [1.9643285694999641]
We propose Contrastive Learning for 3D large multimodal models via Odds ratio on high-Resolution point clouds.
CL3DOR achieves state-of-the-art performance in 3D scene understanding and reasoning benchmarks.
arXiv Detail & Related papers (2025-01-07T15:42:32Z) - ObjVariantEnsemble: Advancing Point Cloud LLM Evaluation in Challenging Scenes with Subtly Distinguished Objects [1.5408065585641535]
3D scene understanding is an important task, and there has been a recent surge of research interest in aligning 3D representations of point clouds with text to empower embodied AI.
Due to the lack of comprehensive 3D benchmarks, the capabilities of 3D models in real-world scenes, particularly those that are challenging with subtly distinguished objects, remain insufficiently investigated.
arXiv Detail & Related papers (2024-12-19T13:27:58Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.
In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.
We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features [64.39691149255717]
Keypoint detection on 3D shapes requires semantic and geometric awareness while demanding high localization accuracy.
We employ a keypoint candidate optimization module which aims to match the average observed distribution of keypoints on the shape.
The resulting approach achieves a new state of the art for few-shot keypoint detection on the KeyPointNet dataset.
arXiv Detail & Related papers (2023-11-29T21:58:41Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding [46.253711788685536]
We introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models.
We devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning.
Our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2% and 9.1% for semantic and instance segmentation.
arXiv Detail & Related papers (2023-04-03T13:30:04Z) - Prompt-guided Scene Generation for 3D Zero-Shot Learning [8.658191774247944]
We propose a prompt-guided 3D scene generation and supervision method to augment 3D data to learn the network better.
First, we merge point clouds of two 3D models in certain ways described by a prompt. The prompt acts like the annotation describing each 3D scene.
We have achieved state-of-the-art ZSL and generalized ZSL performance on synthetic (ModelNet40, ModelNet10) and real-scanned (ScanOjbectNN) 3D object datasets.
arXiv Detail & Related papers (2022-09-29T11:24:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.