Fine-grained Text and Image Guided Point Cloud Completion with CLIP
Model
- URL: http://arxiv.org/abs/2308.08754v1
- Date: Thu, 17 Aug 2023 03:05:18 GMT
- Title: Fine-grained Text and Image Guided Point Cloud Completion with CLIP
Model
- Authors: Wei Song, Jun Zhou, Mingjie Wang, Hongchen Tan, Nannan Li, Xiuping Liu
- Abstract summary: We propose a novel multimodal fusion network for point cloud completion.
We employ a pre-trained vision-language model that is trained with a large amount of image-text pairs.
To further explore the effectiveness of fine-grained text descriptions for point cloud completion, we also build a text corpus with fine-grained descriptions.
- Score: 15.625396852353655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper focuses on the recently popular task of point cloud completion
guided by multimodal information. Although existing methods have achieved
excellent performance by fusing auxiliary images, there are still some
deficiencies, including the poor generalization ability of the model and
insufficient fine-grained semantic information for extracted features. In this
work, we propose a novel multimodal fusion network for point cloud completion,
which can simultaneously fuse visual and textual information to predict the
semantic and geometric characteristics of incomplete shapes effectively.
Specifically, to overcome the lack of prior information caused by the
small-scale dataset, we employ a pre-trained vision-language model that is
trained with a large amount of image-text pairs. Therefore, the textual and
visual encoders of this large-scale model have stronger generalization ability.
Then, we propose a multi-stage feature fusion strategy to fuse the textual and
visual features into the backbone network progressively. Meanwhile, to further
explore the effectiveness of fine-grained text descriptions for point cloud
completion, we also build a text corpus with fine-grained descriptions, which
can provide richer geometric details for 3D shapes. The rich text descriptions
can be used for training and evaluating our network. Extensive quantitative and
qualitative experiments demonstrate the superior performance of our method
compared to state-of-the-art point cloud completion networks.
Related papers
- Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion [34.102157812175854]
We introduce EGIInet (Explicitly Guided Information Interaction Network), a model for View-guided Point cloud Completion task.
EGIInet efficiently combines the information from two modalities by leveraging the geometric nature of the completion task.
We propose a novel explicitly guided information interaction strategy that could help the network identify critical information within images.
arXiv Detail & Related papers (2024-07-03T08:03:56Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models [52.23899502520261]
We introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - Advanced Multimodal Deep Learning Architecture for Image-Text Matching [33.8315200009152]
Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship.
We introduce an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding.
Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets.
arXiv Detail & Related papers (2024-06-13T08:32:24Z) - Language-Assisted 3D Scene Understanding [17.663583203177197]
We propose a language-assisted approach to point cloud feature learning (LAST-PCL)
We achieve de-redundancy and feature dimensionality reduction without compromising textual priors.
The proposed method learns semantically meaningful point cloud features and achieves state-of-the-art or comparable performance in 3D semantic segmentation, 3D object detection, and 3D scene classification tasks.
arXiv Detail & Related papers (2023-12-18T18:54:56Z) - Human as Points: Explicit Point-based 3D Human Reconstruction from Single-view RGB Images [71.91424164693422]
We introduce an explicit point-based human reconstruction framework called HaP.
Our approach is featured by fully-explicit point cloud estimation, manipulation, generation, and refinement in the 3D geometric space.
Our results may indicate a paradigm rollback to the fully-explicit and geometry-centric algorithm design.
arXiv Detail & Related papers (2023-11-06T05:52:29Z) - See More and Know More: Zero-shot Point Cloud Segmentation via
Multi-modal Visual Data [22.53879737713057]
Zero-shot point cloud segmentation aims to make deep models capable of recognizing novel objects in point cloud that are unseen in the training phase.
We propose a novel multi-modal zero-shot learning method to better utilize the complementary information of point clouds and images for more accurate visual-semantic alignment.
arXiv Detail & Related papers (2023-07-20T11:32:51Z) - Ponder: Point Cloud Pre-training via Neural Rendering [93.34522605321514]
We propose a novel approach to self-supervised learning of point cloud representations by differentiable neural encoders.
The learned point-cloud can be easily integrated into various downstream tasks, including not only high-level rendering tasks like 3D detection and segmentation, but low-level tasks like 3D reconstruction and image rendering.
arXiv Detail & Related papers (2022-12-31T08:58:39Z) - Self-Supervised Feature Learning from Partial Point Clouds via Pose
Disentanglement [35.404285596482175]
We propose a novel self-supervised framework to learn informative representations from partial point clouds.
We leverage partial point clouds scanned by LiDAR that contain both content and pose attributes.
Our method not only outperforms existing self-supervised methods, but also shows a better generalizability across synthetic and real-world datasets.
arXiv Detail & Related papers (2022-01-09T14:12:50Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Voxel-based Network for Shape Completion by Leveraging Edge Generation [76.23436070605348]
We develop a voxel-based network for point cloud completion by leveraging edge generation (VE-PCN)
We first embed point clouds into regular voxel grids, and then generate complete objects with the help of the hallucinated shape edges.
This decoupled architecture together with a multi-scale grid feature learning is able to generate more realistic on-surface details.
arXiv Detail & Related papers (2021-08-23T05:10:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.