Related papers: Diff-3DCap: Shape Captioning with Diffusion Models

Diff-3DCap: Shape Captioning with Diffusion Models

URL: http://arxiv.org/abs/2509.23718v1
Date: Sun, 28 Sep 2025 07:59:22 GMT
Title: Diff-3DCap: Shape Captioning with Diffusion Models
Authors: Zhenyu Shu, Jiawei Wen, Shiyang Li, Shiqing Xin, Ligang Liu,
Abstract summary: We introduce Diff-3DCap, which employs a sequence of projected views to represent a 3D object and a continuous diffusion model to facilitate the captioning process.<n>Results of our experiments indicate that Diff-3DCap can achieve performance comparable to that of the current state-of-the-art methods.
Score: 27.69808457236316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The task of 3D shape captioning occupies a significant place within the domain of computer graphics and has garnered considerable interest in recent years. Traditional approaches to this challenge frequently depend on the utilization of costly voxel representations or object detection techniques, yet often fail to deliver satisfactory outcomes. To address the above challenges, in this paper, we introduce Diff-3DCap, which employs a sequence of projected views to represent a 3D object and a continuous diffusion model to facilitate the captioning process. More precisely, our approach utilizes the continuous diffusion model to perturb the embedded captions during the forward phase by introducing Gaussian noise and then predicts the reconstructed annotation during the reverse phase. Embedded within the diffusion framework is a commitment to leveraging a visual embedding obtained from a pre-trained visual-language model, which naturally allows the embedding to serve as a guiding signal, eliminating the need for an additional classifier. Extensive results of our experiments indicate that Diff-3DCap can achieve performance comparable to that of the current state-of-the-art methods.

Related papers

Generative Spatiotemporal Data Augmentation [12.849046721804797]
We use video foundation models to generate realistic 3D spatial and temporal variations from an image dataset.<n> Incorporating synthesized video clips as supplemental data yields consistent performance gains in low-data settings.
arXiv Detail & Related papers (2025-12-14T01:18:48Z)
Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs [28.381287866505637]
We propose a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded.<n>We introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences.<n>Our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.
arXiv Detail & Related papers (2025-03-07T01:59:05Z)
SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior [53.52396082006044]
Current methods struggle to maintain rendering quality at the viewpoint that deviates significantly from the training viewpoints. This issue stems from the sparse training views captured by a fixed camera on a moving vehicle. We propose a novel approach that enhances the capacity of 3DGS by leveraging prior from a Diffusion Model.
arXiv Detail & Related papers (2024-03-29T09:20:29Z)
CAD: Photorealistic 3D Generation via Adversarial Distillation [28.07049413820128]
We propose a novel learning paradigm for 3D synthesis that utilizes pre-trained diffusion models. Our method unlocks the generation of high-fidelity and photorealistic 3D content conditioned on a single image and prompt.
arXiv Detail & Related papers (2023-12-11T18:59:58Z)
Diffusion-based 3D Object Detection with Random Boxes [58.43022365393569]
Existing anchor-based 3D detection methods rely on empiricals setting of anchors, which makes the algorithms lack elegance. Our proposed Diff3Det migrates the diffusion model to proposal generation for 3D object detection by considering the detection boxes as generative targets. In the inference stage, the model progressively refines a set of random boxes to the prediction results.
arXiv Detail & Related papers (2023-09-05T08:49:53Z)
Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views [47.215089338101066]
We present Sparse3D, a novel 3D reconstruction method tailored for sparse view inputs. Our approach distills robust priors from a multiview-consistent diffusion model to refine a neural radiance field. By tapping into 2D priors from powerful image diffusion models, our integrated model consistently delivers high-quality results.
arXiv Detail & Related papers (2023-08-27T11:52:00Z)
IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z)
Deceptive-NeRF/3DGS: Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction [60.52716381465063]
We introduce Deceptive-NeRF/3DGS to enhance sparse-view reconstruction with only a limited set of input images. Specifically, we propose a deceptive diffusion model turning noisy images rendered from few-view reconstructions into high-quality pseudo-observations. Our system progressively incorporates diffusion-generated pseudo-observations into the training image sets, ultimately densifying the sparse input observations by 5 to 10 times.
arXiv Detail & Related papers (2023-05-24T14:00:32Z)
Generative Novel View Synthesis with 3D-Aware Diffusion Models [96.78397108732233]
We present a diffusion-based model for 3D-aware generative novel view synthesis from as few as a single input image. Our method makes use of existing 2D diffusion backbones but, crucially, incorporates geometry priors in the form of a 3D feature volume. In addition to generating novel views, our method has the ability to autoregressively synthesize 3D-consistent sequences.
arXiv Detail & Related papers (2023-04-05T17:15:47Z)
LiP-Flow: Learning Inference-time Priors for Codec Avatars via Normalizing Flows in Latent Space [90.74976459491303]
We introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space. A normalizing flow bridges the two representation spaces and transforms latent samples from one domain to another, allowing us to define a latent likelihood objective. We show that our approach leads to an expressive and effective prior, capturing facial dynamics and subtle expressions better.
arXiv Detail & Related papers (2022-03-15T13:22:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.