Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
- URL: http://arxiv.org/abs/2508.14811v1
- Date: Wed, 20 Aug 2025 16:02:59 GMT
- Title: Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
- Authors: Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen,
- Abstract summary: We introduce Tinker, a versatile framework for high-fidelity 3D editing.<n>Tinker delivers robust, multi-view consistent edits from as few as one or two images.<n>We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing.
- Score: 42.00640307135371
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker
Related papers
- VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats [53.602701067430075]
We introduce VIRGi, a novel approach for rapidly editing the color of scenes modeled by 3DGS.<n>By fine-tuning the weights of a single user, the color edits are seamlessly propagated to the entire scene in just two seconds.<n>An exhaustive validation on diverse datasets demonstrates significant quantitative and qualitative advancements over competitors.
arXiv Detail & Related papers (2026-03-03T13:41:17Z) - Fast Multi-view Consistent 3D Editing with Video Priors [19.790628738739354]
We propose generative Video Prior based 3D Editing (ViP3DE)<n>Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly.<n>Our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.
arXiv Detail & Related papers (2025-11-28T13:31:10Z) - DisCo3D: Distilling Multi-View Consistency for 3D Scene Editing [12.383291424229448]
We propose textbfDisCo3D, a novel framework that distills 3D consistency priors into a 2D editor.<n>Our method first fine-tunes a 3D generator using multi-view inputs for scene adaptation, then trains a 2D editor through consistency distillation.<n> Experimental results show DisCo3D achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality.
arXiv Detail & Related papers (2025-08-03T09:27:41Z) - Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy [36.08715662927022]
We present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing.<n>Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition.
arXiv Detail & Related papers (2025-06-27T17:59:01Z) - Pro3D-Editor : A Progressive-Views Perspective for Consistent and Precise 3D Editing [25.237699330731395]
Text-guided 3D editing aims to precisely edit semantically relevant local 3D regions.<n>Existing methods typically edit 2D views indiscriminately and projecting them back into 3D space.<n>We argue that ideal consistent 3D editing can be achieved through a textitprogressive-views paradigm
arXiv Detail & Related papers (2025-05-31T11:11:55Z) - Portrait Video Editing Empowered by Multimodal Generative Priors [39.747581584889495]
We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts.
Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models.
Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative dataset updates.
arXiv Detail & Related papers (2024-09-20T15:45:13Z) - 3DEgo: 3D Editing on the Go! [6.072473323242202]
We introduce 3DEgo to address a novel problem of directly synthesizing 3D scenes from monocular videos guided by textual prompts.
Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow.
3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources.
arXiv Detail & Related papers (2024-07-14T07:03:50Z) - SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing [58.22339174221563]
We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing.
SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent.
Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures.
arXiv Detail & Related papers (2024-06-25T09:17:35Z) - View-Consistent 3D Editing with Gaussian Splatting [50.6460814430094]
View-consistent Editing (VcEdit) is a novel framework that seamlessly incorporates 3DGS into image editing processes.<n>By incorporating consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency.
arXiv Detail & Related papers (2024-03-18T15:22:09Z) - Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview
Correspondence-Enhanced Diffusion Models [83.97844535389073]
A major obstacle hindering the widespread adoption of 3D content editing is its time-intensive processing.
We propose that by incorporating correspondence regularization into diffusion models, the process of 3D editing can be significantly accelerated.
In most scenarios, our proposed technique brings a 10$times$ speed-up compared to the baseline method and completes the editing of a 3D scene in 2 minutes with comparable quality.
arXiv Detail & Related papers (2023-12-13T23:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.