Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
- URL: http://arxiv.org/abs/2603.05507v1
- Date: Thu, 05 Mar 2026 18:59:59 GMT
- Title: Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
- Authors: Leif Van Holland, Domenic Zingsheim, Mana Takhsha, Hannah Dröge, Patrick Stotko, Markus Plack, Reinhard Klein,
- Abstract summary: High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications.<n>Existing approaches typically rely on simple textures for the hole filling, which can result in inconsistencies or visual artifacts.<n>We propose a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering.<n>We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed.
- Score: 5.442308724054687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.
Related papers
- ViewMorpher3D: A 3D-aware Diffusion Framework for Multi-Camera Novel View Synthesis in Autonomous Driving [20.935790354765604]
We introduce ViewMorpher3D, a multi-view image enhancement framework based on image diffusion models.<n>Unlike single-view approaches, ViewMorpher3D jointly processes a set of rendered views conditioned on camera poses, 3D geometric priors, and temporally adjacent or spatially overlapping reference views.<n>Our framework accommodates variable numbers of cameras and flexible reference/target view configurations, making it adaptable to diverse sensor setups.
arXiv Detail & Related papers (2026-01-12T13:44:14Z) - MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds [19.94963757122156]
Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints.<n>We introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images.<n>Our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.
arXiv Detail & Related papers (2025-12-24T06:59:29Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - Portrait Video Editing Empowered by Multimodal Generative Priors [39.747581584889495]
We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts.
Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models.
Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative dataset updates.
arXiv Detail & Related papers (2024-09-20T15:45:13Z) - View-consistent Object Removal in Radiance Fields [14.195400035176815]
Radiance Fields (RFs) have emerged as a crucial technology for 3D scene representation.
Current methods rely on per-frame 2D image inpainting, which often fails to maintain consistency across views.
We introduce a novel RF editing pipeline that significantly enhances consistency by requiring the inpainting of only a single reference image.
arXiv Detail & Related papers (2024-08-04T17:57:23Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - DeepMultiCap: Performance Capture of Multiple Characters Using Sparse
Multiview Cameras [63.186486240525554]
DeepMultiCap is a novel method for multi-person performance capture using sparse multi-view cameras.
Our method can capture time varying surface details without the need of using pre-scanned template models.
arXiv Detail & Related papers (2021-05-01T14:32:13Z) - Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time.
"TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches.
TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z) - Task-agnostic Temporally Consistent Facial Video Editing [84.62351915301795]
We propose a task-agnostic, temporally consistent facial video editing framework.
Based on a 3D reconstruction model, our framework is designed to handle several editing tasks in a more unified and disentangled manner.
Compared with the state-of-the-art facial image editing methods, our framework generates video portraits that are more photo-realistic and temporally smooth.
arXiv Detail & Related papers (2020-07-03T02:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.