Related papers: Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel

Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel

URL: http://arxiv.org/abs/2509.24979v2
Date: Tue, 30 Sep 2025 06:18:05 GMT
Title: Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel
Authors: Haotian Dong, Wenjing Wang, Chen Li, Di Lin,
Abstract summary: Wan-Alpha is a new framework that generates transparent videos by learning both RGB and alpha channels jointly.<n>Compared with state-of-the-art methods, our model demonstrates superior performance in visual quality, motion realism, and transparency rendering.
Score: 14.361698701397545
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: RGBA video generation, which includes an alpha channel to represent transparency, is gaining increasing attention across a wide range of applications. However, existing methods often neglect visual quality, limiting their practical usability. In this paper, we propose Wan-Alpha, a new framework that generates transparent videos by learning both RGB and alpha channels jointly. We design an effective variational autoencoder (VAE) that encodes the alpha channel into the RGB latent space. Then, to support the training of our diffusion transformer, we construct a high-quality and diverse RGBA video dataset. Compared with state-of-the-art methods, our model demonstrates superior performance in visual quality, motion realism, and transparency rendering. Notably, our model can generate a wide variety of semi-transparent objects, glowing effects, and fine-grained details such as hair strands. The released model is available on our website: https://donghaotian123.github.io/Wan-Alpha/.

Related papers

OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation [43.93970229518124]
We propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing.<n>Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.
arXiv Detail & Related papers (2025-11-25T11:34:51Z)
AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning [32.798523698352916]
We propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds.<n>We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel.<n>Our RGBA VAE, trained on only 8K images in contrast to 1M used by prior methods, achieves a +4.9 dB improvement in PSNR and a +3.2% increase in SSIM over LayerDiffuse in reconstruction.
arXiv Detail & Related papers (2025-07-12T14:53:42Z)
TransAnimate: Taming Layer Diffusion to Generate RGBA Video [3.7031943280491997]
TransAnimate is an innovative framework that integrates RGBA image generation techniques with video generation modules.<n>We introduce an interactive motion-guided control mechanism, where directional arrows define movement and colors adjust scaling.<n>We have developed a pipeline for creating an RGBA video dataset, incorporating high-quality game effect videos, extracted foreground objects, and synthetic transparent videos.
arXiv Detail & Related papers (2025-03-23T04:27:46Z)
Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT [98.56372305225271]
Lumina-Next achieves exceptional performance in the generation of images with Next-DiT.<n> Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications.<n>We propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos.
arXiv Detail & Related papers (2025-02-10T18:58:11Z)
TransPixeler: Advancing Text-to-Video Generation with Transparency [43.6546902960154]
We introduce TransPixeler, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities.<n>Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
arXiv Detail & Related papers (2025-01-06T13:32:16Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames. All the frames in each video are manually annotated to a high-quality saliency annotation. We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z)
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo. Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z)
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods. Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
A Good Image Generator Is What You Need for High-Resolution Video Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.