JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues
- URL: http://arxiv.org/abs/2310.09503v3
- Date: Thu, 25 Jan 2024 06:34:09 GMT
- Title: JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues
- Authors: Jiayi Ji, Haowei Wang, Changli Wu, Yiwei Ma, Xiaoshuai Sun, Rongrong
Ji
- Abstract summary: We introduce JM3D, a comprehensive approach integrating point cloud, text, and image.
Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text.
Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning.
- Score: 68.76032126906743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rising importance of 3D understanding, pivotal in computer vision,
autonomous driving, and robotics, is evident. However, a prevailing trend,
which straightforwardly resorted to transferring 2D alignment strategies to the
3D domain, encounters three distinct challenges: (1) Information Degradation:
This arises from the alignment of 3D data with mere single-view 2D images and
generic texts, neglecting the need for multi-view images and detailed
subcategory texts. (2) Insufficient Synergy: These strategies align 3D
representations to image and text features individually, hampering the overall
optimization for 3D models. (3) Underutilization: The fine-grained information
inherent in the learned representations is often not fully exploited,
indicating a potential loss in detail. To address these issues, we introduce
JM3D, a comprehensive approach integrating point cloud, text, and image. Key
contributions include the Structured Multimodal Organizer (SMO), enriching
vision-language representation with multiple views and hierarchical text, and
the Joint Multi-modal Alignment (JMA), combining language understanding with
visual representation. Our advanced model, JM3D-LLM, marries 3D representation
with large language models via efficient fine-tuning. Evaluations on ModelNet40
and ScanObjectNN establish JM3D's superiority. The superior performance of
JM3D-LLM further underscores the effectiveness of our representation transfer
approach. Our code and models are available at https://github.com/Mr-Neko/JM3D.
Related papers
- VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding [47.58359136198136]
VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models.
It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models.
It identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs.
arXiv Detail & Related papers (2024-03-14T16:13:00Z) - TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters.
TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z) - 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT.
This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information.
We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z) - Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D
Prior [52.44678180286886]
2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data.
We propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously.
arXiv Detail & Related papers (2023-12-11T18:59:18Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.