Uni-Animator: Towards Unified Visual Colorization
- URL: http://arxiv.org/abs/2602.23191v2
- Date: Tue, 03 Mar 2026 07:49:53 GMT
- Title: Uni-Animator: Towards Unified Visual Colorization
- Authors: Xinyuan Chen, Yao Xu, Shaowen Wang, Pengjie Song, Bowen Deng,
- Abstract summary: We propose Uni-Animator, a novel framework for unified image and video sketch colorization.<n>Existing sketch colorization methods struggle to unify image and video tasks.<n>We introduce visual reference enhancement via instance patch embedding.<n>We design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures.
- Score: 23.467435361820392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.
Related papers
- VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats [53.602701067430075]
We introduce VIRGi, a novel approach for rapidly editing the color of scenes modeled by 3DGS.<n>By fine-tuning the weights of a single user, the color edits are seamlessly propagated to the entire scene in just two seconds.<n>An exhaustive validation on diverse datasets demonstrates significant quantitative and qualitative advancements over competitors.
arXiv Detail & Related papers (2026-03-03T13:41:17Z) - IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation [58.297199313494]
Implicit methods capture motion semantics directly from driving video, but suffer from identity leakage and entanglement between motion and appearance.<n>We propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens.<n>Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity.
arXiv Detail & Related papers (2026-02-07T11:17:20Z) - Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence [81.82643953694485]
We present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint.<n>Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video.<n>We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing.
arXiv Detail & Related papers (2025-12-03T15:51:11Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors [64.54220123913154]
We introduce FramePainter as an efficient instantiation of image-to-video generation problem.<n>It only uses a lightweight sparse control encoder to inject editing signals.<n>It domainantly outperforms previous state-of-the-art methods with far less training data.
arXiv Detail & Related papers (2025-01-14T16:09:16Z) - DreamColour: Controllable Video Colour Editing without Training [80.90808879991182]
We present a training-free framework that makes precise video colour editing accessible through an intuitive interface.<n>By decoupling spatial and temporal aspects of colour editing, we can better align with users' natural workflow.<n>Our approach matches or exceeds state-of-the-art methods while eliminating the need for training or specialized hardware.
arXiv Detail & Related papers (2024-12-06T16:57:54Z) - LatentColorization: Latent Diffusion-Based Speaker Video Colorization [1.2641141743223379]
We introduce a novel solution for achieving temporal consistency in video colorization.
We demonstrate strong improvements on established image quality metrics compared to other existing methods.
Our dataset encompasses a combination of conventional datasets and videos from television/movies.
arXiv Detail & Related papers (2024-05-09T12:06:06Z) - Histogram-guided Video Colorization Structure with Spatial-Temporal
Connection [10.059070138875038]
Histogram-guided Video Colorization with Spatial-Temporal connection structure (named ST-HVC)
To fully exploit the chroma and motion information, the joint flow and histogram module is tailored to integrate the histogram and flow features.
We show that the developed method achieves excellent performance both quantitatively and qualitatively in two video datasets.
arXiv Detail & Related papers (2023-08-09T11:59:18Z) - Temporally Consistent Video Colorization with Deep Feature Propagation
and Self-regularization Learning [90.38674162878496]
We propose a novel temporally consistent video colorization framework (TCVC)
TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization.
Experiments demonstrate that our method can not only obtain visually pleasing colorized video, but also achieve clearly better temporal consistency than state-of-the-art methods.
arXiv Detail & Related papers (2021-10-09T13:00:14Z) - Line Art Correlation Matching Feature Transfer Network for Automatic
Animation Colorization [0.0]
We propose a correlation matching feature transfer model (called CMFT) to align the colored reference feature in a learnable way.
This enables the generator to transfer the layer-wise synchronized features from the deep semantic code to the content progressively.
arXiv Detail & Related papers (2020-04-14T06:50:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.