From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos
- URL: http://arxiv.org/abs/2507.20331v2
- Date: Tue, 29 Jul 2025 03:14:12 GMT
- Title: From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos
- Authors: Chenjian Gao, Lihe Ding, Rui Han, Zhanpeng Huang, Zibin Wang, Tianfan Xue,
- Abstract summary: 2D diffusion models have shown promise for producing photorealistic edits.<n>Traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting.<n>This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion.
- Score: 8.444819892052958
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inserting 3D objects into videos is a longstanding challenge in computer graphics with applications in augmented reality, virtual try-on, and video composition. Achieving both temporal consistency, or realistic lighting remains difficult, particularly in dynamic scenarios with complex object motion, perspective changes, and varying illumination. While 2D diffusion models have shown promise for producing photorealistic edits, they often struggle with maintaining temporal coherence across frames. Conversely, traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting. In this work, we propose a hybrid object insertion pipeline that combines the strengths of both paradigms. Specifically, we focus on inserting bracelets into dynamic wrist scenes, leveraging the high temporal consistency of 3D Gaussian Splatting (3DGS) for initial rendering and refining the results using a 2D diffusion-based enhancement model to ensure realistic lighting interactions. Our method introduces a shading-driven pipeline that separates intrinsic object properties (albedo, shading, reflectance) and refines both shading and sRGB images for photorealism. To maintain temporal coherence, we optimize the 3DGS model with multi-frame weighted adjustments. This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion, offering a robust solution for realistic and consistent video editing. Project Page: https://cjeen.github.io/BraceletPaper/
Related papers
- Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos [71.24593306228145]
We propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields.<n>We address this issue through test-time refinement, which helps the model to focus on specific frames, thereby reducing the data complexity.<n>This demonstrates that 3D techniques can enhance 2D analysis even for dynamic phenomena in a challenging and realistic setting.
arXiv Detail & Related papers (2025-06-05T19:46:48Z) - MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation [19.46962637673285]
MV-CoLight is a framework for illumination-consistent object compositing in 2D and 3D scenes.<n>We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly.<n> Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset.
arXiv Detail & Related papers (2025-05-27T17:53:02Z) - 3D Object Manipulation in a Single Image using Generative Models [30.241857090353864]
We introduce textbfOMG3D, a novel framework that integrates the precise geometric control with the generative power of diffusion models.<n>Our framework first converts 2D objects into 3D, enabling user-directed modifications and lifelike motions at the geometric level.<n>Remarkably, all these steps can be done using one NVIDIA 3090.
arXiv Detail & Related papers (2025-01-22T15:06:30Z) - Real-time 3D-aware Portrait Video Relighting [89.41078798641732]
We present the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF)
We infer an albedo tri-plane, as well as a shading tri-plane based on a desired lighting condition for each video frame with fast dual-encoders.
Our method runs at 32.98 fps on consumer-level hardware and achieves state-of-the-art results in terms of reconstruction quality, lighting error, lighting instability, temporal consistency and inference speed.
arXiv Detail & Related papers (2024-10-24T01:34:11Z) - Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models [54.35214051961381]
3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use in movies, games, AR, and VR.<n>However, creating temporal consistent and realistic textures for mesh remains labor-intensive for professional artists.<n>We present 3D Tex sequences that integrates inherent geometry from mesh sequences with video diffusion models to produce consistent textures.
arXiv Detail & Related papers (2024-10-14T17:59:59Z) - Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors [17.544733016978928]
3D object generation from a single image involves estimating the full 3D geometry and texture of unseen views from an unposed RGB image captured in the wild.
Recent advancements in 3D object generation have introduced techniques that reconstruct an object's 3D shape and texture.
We propose bridging the gap between 2D and 3D diffusion models to address this limitation.
arXiv Detail & Related papers (2024-10-12T10:14:11Z) - ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image
Collections [71.46546520120162]
Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging.
We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild.
We produce realistic animations by fine-tuning the rendered shape and texture under rigid part transformations.
arXiv Detail & Related papers (2023-06-07T17:47:50Z) - A Shading-Guided Generative Implicit Model for Shape-Accurate 3D-Aware
Image Synthesis [163.96778522283967]
We propose a shading-guided generative implicit model that is able to learn a starkly improved shape representation.
An accurate 3D shape should also yield a realistic rendering under different lighting conditions.
Our experiments on multiple datasets show that the proposed approach achieves photorealistic 3D-aware image synthesis.
arXiv Detail & Related papers (2021-10-29T10:53:12Z) - Learning Indoor Inverse Rendering with 3D Spatially-Varying Lighting [149.1673041605155]
We address the problem of jointly estimating albedo, normals, depth and 3D spatially-varying lighting from a single image.
Most existing methods formulate the task as image-to-image translation, ignoring the 3D properties of the scene.
We propose a unified, learning-based inverse framework that formulates 3D spatially-varying lighting.
arXiv Detail & Related papers (2021-09-13T15:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.