Related papers: OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

URL: http://arxiv.org/abs/2511.20211v1
Date: Tue, 25 Nov 2025 11:34:51 GMT
Title: OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation
Authors: Hao Yu, Jiabo Zhan, Zile Wang, Jinglin Wang, Huaisong Zhang, Hongyu Li, Xinrui Chen, Yongxian Wei, Chun Yuan,
Abstract summary: We propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing.<n>Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.
Score: 43.93970229518124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.

Related papers

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition [73.43121650616804]
We propose textbfQwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers.<n>Our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing.
arXiv Detail & Related papers (2025-12-17T17:12:42Z)
UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation [54.38636515750502]
We propose UniLayDiff: a Unified Diffusion Transformer for content-aware layout generation tasks.<n>We employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints.<n>Experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks.
arXiv Detail & Related papers (2025-12-09T18:38:44Z)
UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment [22.51114099598294]
Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments.<n>UniFit is a universal VTON framework driven by a Multimodal Large Language Model (MLLM)<n>UniFit supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-11-19T19:38:44Z)
HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection [75.406055413928]
We propose a novel prompt-driven segment anything model (HyPSAM) for RGB-T SOD.<n> DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction.<n>Experiments on three public datasets demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-09-23T07:32:11Z)
AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning [32.798523698352916]
We propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds.<n>We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel.<n>Our RGBA VAE, trained on only 8K images in contrast to 1M used by prior methods, achieves a +4.9 dB improvement in PSNR and a +3.2% increase in SSIM over LayerDiffuse in reconstruction.
arXiv Detail & Related papers (2025-07-12T14:53:42Z)
PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment [23.67447416568964]
Transparent image layer generation plays a significant role in digital art and design.<n>Existing methods typically decompose transparent layers from a single RGB image using a set of tools or generate multiple transparent layers sequentially.<n>We propose PSDiffusion, a unified diffusion framework that leverages image composition priors from pre-trained image diffusion model for simultaneous multi-layer text-to-image generation.
arXiv Detail & Related papers (2025-05-16T17:23:35Z)
Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.<n>MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.<n>The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z)
MMGen: Unified Multi-modal Image Generation and Understanding in One Go [60.97155790727879]
We introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model.<n>Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy.
arXiv Detail & Related papers (2025-03-26T15:37:17Z)
TransPixeler: Advancing Text-to-Video Generation with Transparency [43.6546902960154]
We introduce TransPixeler, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities.<n>Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
arXiv Detail & Related papers (2025-01-06T13:32:16Z)
SSFam: Scribble Supervised Salient Object Detection Family [13.369217449092524]
Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods.
arXiv Detail & Related papers (2024-09-07T13:07:59Z)
MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation [54.64194935409982]
We introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer-wise RGBA decompositions. MuLAn is the first photorealistic resource providing instance decomposition and spatial information for high quality images. We aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions.
arXiv Detail & Related papers (2024-04-03T14:58:00Z)
Unifying Voxel-based Representation with Transformer for 3D Object Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.