TransPixeler: Advancing Text-to-Video Generation with Transparency
- URL: http://arxiv.org/abs/2501.03006v2
- Date: Mon, 20 Jan 2025 12:46:06 GMT
- Title: TransPixeler: Advancing Text-to-Video Generation with Transparency
- Authors: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen,
- Abstract summary: We introduce TransPixeler, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities.<n>Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
- Score: 43.6546902960154
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixeler, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
Related papers
- Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting [60.062438188868306]
existing image inpainting methods are designed exclusively for RGB images.<n>Trans-Adapter is a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly.
arXiv Detail & Related papers (2025-08-01T22:27:21Z) - AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning [32.798523698352916]
We propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds.<n>We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel.<n>Our RGBA VAE, trained on only 8K images in contrast to 1M used by prior methods, achieves a +4.9 dB improvement in PSNR and a +3.2% increase in SSIM over LayerDiffuse in reconstruction.
arXiv Detail & Related papers (2025-07-12T14:53:42Z) - TransAnimate: Taming Layer Diffusion to Generate RGBA Video [3.7031943280491997]
TransAnimate is an innovative framework that integrates RGBA image generation techniques with video generation modules.
We introduce an interactive motion-guided control mechanism, where directional arrows define movement and colors adjust scaling.
We have developed a pipeline for creating an RGBA video dataset, incorporating high-quality game effect videos, extracted foreground objects, and synthetic transparent videos.
arXiv Detail & Related papers (2025-03-23T04:27:46Z) - VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition [54.27379947727035]
This paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification.<n>The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network.<n>The source code and pre-trained models will be released on urlhttps://github.com/Event-AHU/VELoRA.
arXiv Detail & Related papers (2024-12-28T07:38:23Z) - World-consistent Video Diffusion with Explicit 3D Modeling [67.39618291644673]
World-consistent Video Diffusion (WVD) is a novel framework that incorporates explicit 3D supervision using XYZ images.<n>We train a diffusion transformer to learn the joint distribution of RGB and XYZ frames.<n>WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation.
arXiv Detail & Related papers (2024-12-02T18:58:23Z) - SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution [35.894647722880805]
Diffusion-based Video Super-Resolution (VSR) is renowned for generating perceptually realistic videos.
We introduce SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls.
Our framework integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames.
arXiv Detail & Related papers (2024-10-08T08:33:47Z) - TVG: A Training-free Transition Video Generation Method with Diffusion Models [12.037716102326993]
Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives.
Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes.
We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training.
arXiv Detail & Related papers (2024-08-24T00:33:14Z) - SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames.
All the frames in each video are manually annotated to a high-quality saliency annotation.
We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z) - Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion [12.982885009492389]
We show how to improve the performance of a visual Transformer by using direct fusion of cross-modal channels and spatial features.
CSTNet achieves state-of-the-art performance on three public RGB-T tracking benchmarks.
arXiv Detail & Related papers (2024-05-06T05:58:49Z) - UniRGB-IR: A Unified Framework for RGB-Infrared Semantic Tasks via Adapter Tuning [17.36726475620881]
We propose a general and efficient framework called UniRGB-IR to unify RGB-IR semantic tasks.
A novel adapter is developed to efficiently introduce richer RGB-IR features into the pre-trained foundation model.
Experimental results on various RGB-IR downstream tasks demonstrate that our method can achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-04-26T12:21:57Z) - EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion [55.367269556557645]
EvPlug learns a plug-and-play event and image fusion module from the supervision of the existing RGB-based model.
We demonstrate the superiority of EvPlug in several vision tasks such as object detection, semantic segmentation, and 3D hand pose estimation.
arXiv Detail & Related papers (2023-12-28T10:05:13Z) - Recaptured Raw Screen Image and Video Demoir\'eing via Channel and
Spatial Modulations [16.122531943812465]
We propose an image and video demoir'eing network tailored for raw inputs.
We introduce a color-separated feature branch, and it is fused with the traditional feature-mixed branch via channel and spatial modulations.
Experiments demonstrate that our method achieves state-of-the-art performance for both image and video demori'eing.
arXiv Detail & Related papers (2023-10-31T10:19:28Z) - Learning Modal-Invariant and Temporal-Memory for Video-based
Visible-Infrared Person Re-Identification [46.49866514866999]
We primarily study the video-based cross-modal person Re-ID method.
We prove that with the increase of frames in a tracklet, the performance does meet more enhancement.
A novel method is proposed, which projects two modalities to a modal-invariant subspace.
arXiv Detail & Related papers (2022-08-04T04:43:52Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient
object detection [12.126413875108993]
We propose a cross-modality fusion model SwinNet for RGB-D and RGB-T salient object detection.
The proposed model outperforms the state-of-the-art models on RGB-D and RGB-T datasets.
arXiv Detail & Related papers (2022-04-12T07:37:39Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.