SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation
- URL: http://arxiv.org/abs/2502.19128v1
- Date: Wed, 26 Feb 2025 13:36:40 GMT
- Title: SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation
- Authors: Junlong Ren, Hao Wu, Hui Xiong, Hao Wang,
- Abstract summary: Cross-modal 3D retrieval aims to achieve mutual matching between text descriptions and 3D shapes.<n>The scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods.<n>We introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval.
- Score: 21.070154402838906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The cross-modal 3D retrieval task aims to achieve mutual matching between text descriptions and 3D shapes. This has the potential to enhance the interaction between natural language and the 3D environment, especially within the realms of robotics and embodied artificial intelligence (AI) applications. However, the scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods. These methods heavily rely on features derived from the limited number of 3D shapes, resulting in poor generalization ability across diverse scenarios. To address this challenge, we introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval. Our approach uses the LLaVA model to create a component library, captioning each segmented part of every 3D shape within the dataset. Notably, it facilitates the generation of extensive new 3D-text pairs containing new semantic features. We employ both inter and intra distances to align various components into a new 3D shape, ensuring that the components do not overlap and are closely fitted. Further, text templates are utilized to process the captions of each component and generate new text descriptions. Besides, we use unimodal encoders to extract embeddings for 3D shapes and texts based on the enriched dataset. We then calculate fine-grained cross-modal similarity using Earth Mover's Distance (EMD) and enhance cross-modal matching with contrastive learning, enabling bidirectional retrieval between texts and 3D shapes. Extensive experiments show our SCA3D outperforms previous works on the Text2Shape dataset, raising the Shape-to-Text RR@1 score from 20.03 to 27.22 and the Text-to-Shape RR@1 score from 13.12 to 16.67. Codes can be found in https://github.com/3DAgentWorld/SCA3D.
Related papers
- Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction [4.820576346277399]
Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities.
We propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment.
Our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks.
arXiv Detail & Related papers (2025-04-02T08:29:42Z) - AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.
Existing approaches commonly encounter a shortage of text3D pairs available for training.
We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation [2.3213238782019316]
GIMDiffusion is a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images.
We exploit the rich 2D priors of existing Text-to-Image models such as Stable Diffusion.
In short, GIMDiffusion enables the generation of 3D assets at speeds comparable to current Text-to-Image models.
arXiv Detail & Related papers (2024-09-05T17:21:54Z) - VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder [56.59814904526965]
This paper introduces a pioneering 3D encoder designed for text-to-3D generation.
A lightweight network is developed to efficiently acquire feature volumes from multi-view images.
The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net.
arXiv Detail & Related papers (2023-12-18T18:59:05Z) - Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D
Prior [52.44678180286886]
2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data.
We propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously.
arXiv Detail & Related papers (2023-12-11T18:59:18Z) - TPA3D: Triplane Attention for Fast Text-to-3D Generation [28.33270078863519]
We propose Triplane Attention for text-guided 3D generation (TPA3D)
TPA3D is an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation.
We show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions.
arXiv Detail & Related papers (2023-12-05T10:39:37Z) - 3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with
2D Diffusion Models [102.75875255071246]
3D content creation via text-driven stylization has played a fundamental challenge to multimedia and graphics community.
We propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models.
arXiv Detail & Related papers (2023-11-09T15:51:27Z) - OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D
Data [15.53270401654078]
OVIR-3D is a method for open-vocabulary 3D object instance retrieval without using any 3D data for training.
It is achieved by a multi-view fusion of text-aligned 2D region proposals into 3D space.
Experiments on public datasets and a real robot show the effectiveness of the method and its potential for applications in robot navigation and manipulation.
arXiv Detail & Related papers (2023-11-06T05:00:00Z) - T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF.
Most studies evaluate their results with subjective case studies and user experiments.
We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z) - UniG3D: A Unified 3D Object Generation Dataset [75.49544172927749]
UniG3D is a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on ShapeNet datasets.
This pipeline converts each raw 3D model into comprehensive multi-modal data representation.
The selection of data sources for our dataset is based on their scale and quality.
arXiv Detail & Related papers (2023-06-19T07:03:45Z) - 3D Shape Knowledge Graph for Cross-domain 3D Shape Retrieval [20.880210749809642]
"geometric words" function as elemental constituents for representing entities through combinations.
Each 3D or 2D entity can anchor its geometric terms within the knowledge graph, thereby serving as a link between cross-domain data.
We evaluate the proposed method's performance on the ModelNet40 and ShapeNetCore55 datasets.
arXiv Detail & Related papers (2022-10-27T02:51:24Z) - Stereo Object Matching Network [78.35697025102334]
This paper presents a stereo object matching method that exploits both 2D contextual information from images and 3D object-level information.
We present two novel strategies to handle 3D objectness in the cost volume space: selective sampling (RoISelect) and 2D-3D fusion.
arXiv Detail & Related papers (2021-03-23T12:54:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.