UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
- URL: http://arxiv.org/abs/2511.18050v1
- Date: Sat, 22 Nov 2025 13:07:21 GMT
- Title: UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
- Authors: Tian Ye, Song Fei, Lei Zhu,
- Abstract summary: We introduce UltraFlux, a Flux-based DiT trained at 4K on MultiAspect-4K-1M.<n>On the model side, UltraFlux couples Resonance 2D RoPE with YaRN for training-window, frequency-, and AR-aware positional encoding at 4K.<n>On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics.
- Score: 11.829523789114377
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.
Related papers
- Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention [50.391914489898774]
Scale-DiT is a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance.<n>A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail.<n>Experiments demonstrate that Scale-DiT achieves more than $2times$ faster inference and lower memory usage.
arXiv Detail & Related papers (2025-10-18T03:15:26Z) - 4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming [52.76837132019501]
We introduce 4DGCPro, a novel hierarchical 4D compression framework.<n>4DGCPro facilitates real-time mobile decoding and high-quality rendering via progressive volumetric video streaming.<n>We present an end-to-end entropy-optimized training scheme.
arXiv Detail & Related papers (2025-09-22T08:38:17Z) - HiMat: DiT-based Ultra-High Resolution SVBRDF Generation [26.081964370337943]
HiMat is a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation.<n>CrossStitch is a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention.
arXiv Detail & Related papers (2025-08-09T15:16:58Z) - 4KAgent: Agentic Any Image to 4K Super-Resolution [62.99433518118836]
We present 4KAgent, a super-resolution generalist system designed to upscale any image to 4K resolution (and even higher, if applied iteratively)<n>4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a quality-driven mixture-of-expert policy to select the optimal output for each step.<n>We rigorously evaluate our 4KAgent
arXiv Detail & Related papers (2025-07-09T17:59:19Z) - Ultra-High-Resolution Image Synthesis: Data, Method and Evaluation [21.46605047406198]
Aesthetic-4K dataset is curated for comprehensive research on ultra-high-resolution image synthesis.<n>Diffusion-4K is an innovative framework for the direct generation of ultra-high-resolution images.
arXiv Detail & Related papers (2025-06-02T05:19:40Z) - Scaling Vision Pre-Training to 4K Resolution [120.32767371797578]
We introduce PS3 that scales vision pre-training to 4K resolution with a near-constant cost.<n>Instead of contrastive learning on global representation, PS3 is pre-trained by selectively processing local regions.<n>PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions.
arXiv Detail & Related papers (2025-03-25T17:58:37Z) - Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models [21.46605047406198]
Diffusion-4K is a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models.<n>We construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation.<n>We propose a wavelet-based fine-tuning approach for direct training with 4K images, applicable to various latent diffusion models.
arXiv Detail & Related papers (2025-03-24T05:25:07Z) - Highly Efficient No-reference 4K Video Quality Assessment with Full-Pixel Covering Sampling and Training Strategy [23.61467796740852]
No-reference (NR) VQA methods play a vital role in situations where obtaining reference videos is restricted or not feasible.
As more streaming videos are being created in ultra-high definition (e.g., 4K) to enrich viewers' experiences, the current deep VQA methods face unacceptable computational costs.
In this paper, we propose a highly efficient and novel NR 4K VQA technology.
arXiv Detail & Related papers (2024-07-30T12:10:33Z) - InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD [129.9919468062788]
InternLM-XComposer2-4KHD is a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 x 1600) and beyond.
This research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration.
Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements.
arXiv Detail & Related papers (2024-04-09T17:59:32Z) - Probabilistic-based Feature Embedding of 4-D Light Fields for
Compressive Imaging and Denoising [62.347491141163225]
4-D light field (LF) poses great challenges in achieving efficient and effective feature embedding.
We propose a probabilistic-based feature embedding (PFE), which learns a feature embedding architecture by assembling various low-dimensional convolution patterns.
Our experiments demonstrate the significant superiority of our methods on both real-world and synthetic 4-D LF images.
arXiv Detail & Related papers (2023-06-15T03:46:40Z) - 4K-HAZE: A Dehazing Benchmark with 4K Resolution Hazy and Haze-Free
Images [12.402054374952485]
We develop a novel method to simulate 4K hazy images from clear images, which first estimates the scene depth, simulates the light rays and object reflectance, then migrates the synthetic images to real domains by using a GAN.
We wrap these synthesized images into a benchmark called the 4K-HAZE dataset.
The most appealing aspect of our approach is the capability to run a 4K image on a single GPU with 24G RAM in real-time (33fps)
arXiv Detail & Related papers (2023-03-28T09:39:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.