Related papers: LSSGen: Leveraging Latent Space Scaling in Flow and Diffusion for Efficient Text to Image Generation

Related papers

ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models [33.09645476860831]
We propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation.<n>Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process.<n>Experiments show that our approach produces high-fidelity images without compromising latency compared to existing methods.
arXiv Detail & Related papers (2026-02-13T05:59:57Z)
Multi-Scale Local Speculative Decoding for Image Generation [10.239314110594249]
We introduce Multi-Scale Local Speculative Decoding (MuLo-SD)<n>MuLo-SD combines multi-resolution drafting with spatially informed verification to accelerate AR image generation.<n>We demonstrate that MuLo-SD achieves substantial speedups up to $mathbf1.7times$.
arXiv Detail & Related papers (2026-01-08T17:39:35Z)
Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation [87.00172597953228]
Speculative decoding has shown promise in accelerating text generation without compromising quality.<n>We introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions.<n> Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models.
arXiv Detail & Related papers (2025-10-29T17:43:31Z)
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis [51.81849724354083]
Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds.<n>We propose to decode arbitrary resolution images with a compact generated latent using a one-step generator.<n>InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.
arXiv Detail & Related papers (2025-09-12T17:48:57Z)
SODiff: Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal [50.90827365790281]
SODiff is a semantic-oriented one-step diffusion model for JPEG artifacts removal.<n>Our core idea is that effective restoration hinges on providing semantic-oriented guidance to the pre-trained diffusion model.<n>SAIPE extracts rich features from low-quality (LQ) images and projects them into an embedding space semantically aligned with that of the text encoder.
arXiv Detail & Related papers (2025-08-10T13:48:07Z)
HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation [91.08481618973111]
Visual Auto-Regressive modeling ( VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models.<n>We introduce Hierarchical Masked Auto-Regressive modeling (HMAR) to generate high-quality images with fast sampling.<n>HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor.
arXiv Detail & Related papers (2025-06-04T20:08:07Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation [62.77721499671665]
We introduce GigaTok, the first approach to improve image reconstruction, generation, and representation learning when scaling visual tokenizers.<n>We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma.<n>By scaling to $bf3 space billion$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
arXiv Detail & Related papers (2025-04-11T17:59:58Z)
Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression [90.59962443790593]
In this paper, we present a variable-rate image compression model based on invertible transform to overcome limitations.<n> Specifically, we design a lightweight multi-scale invertible neural network, which maps the input image into multi-scale latent representations.<n> Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods.
arXiv Detail & Related papers (2025-03-27T09:08:39Z)
NFIG: Autoregressive Image Generation with Next-Frequency Prediction [50.69346038028673]
We present textbfNext-textbfFrequency textbfImage textbfGeneration (textbfNFIG), a novel framework that decomposes the image generation process into multiple frequency-guided stages.<n>Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images.
arXiv Detail & Related papers (2025-03-10T08:59:10Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.<n>We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.<n>We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
Fast constrained sampling in pre-trained diffusion models [80.99262780028015]
We propose an algorithm that enables fast, high-quality generation under arbitrary constraints.<n>Our approach produces results that rival or surpass the state-of-the-art training-free inference methods.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.57727062920458]
We present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL.<n>We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers.<n>Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z)
Timestep-Aware Diffusion Model for Extreme Image Rescaling [47.89362819768323]
We propose a novel framework called Timestep-Aware Diffusion Model (TADM) for extreme image rescaling.<n>TADM performs rescaling operations in the latent space of a pre-trained autoencoder.<n>It effectively leverages powerful natural image priors learned by a pre-trained text-to-image diffusion model.
arXiv Detail & Related papers (2024-08-17T09:51:42Z)
HyperSpace: Hypernetworks for spacing-adaptive image segmentation [0.05958478403940788]
We propose to condition segmentation models on the voxel spacing using hypernetworks. Our approach allows processing images at their native resolutions or at resolutions adjusted to the hardware and time constraints at inference time.
arXiv Detail & Related papers (2024-07-04T07:09:23Z)
Image-GS: Content-Adaptive Image Representation via 2D Gaussians [52.598772767324036]
We introduce Image-GS, a content-adaptive image representation based on 2D Gaussians radiance.<n>It supports hardware-friendly rapid access for real-time usage, requiring only 0.3K MACs to decode a pixel.<n>We demonstrate its versatility with several applications, including texture compression, semantics-aware compression, and joint image compression and restoration.
arXiv Detail & Related papers (2024-07-02T00:45:21Z)
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.<n>Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z)
Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder [29.924160271522354]
Super-resolution (SR) and image generation are important tasks in computer vision and are widely adopted in real-world applications. Most existing methods, however, generate images only at fixed-scale magnification and suffer from over-smoothing and artifacts. Most relevant work applied Implicit Neural Representation (INR) to the denoising diffusion model to obtain continuous-resolution yet diverse and high-quality SR results. We propose a novel pipeline that can super-resolve an input image or generate from a random noise a novel image at arbitrary scales.
arXiv Detail & Related papers (2024-03-15T12:45:40Z)
Efficient texture-aware multi-GAN for image inpainting [5.33024001730262]
Recent GAN-based (Generative adversarial networks) inpainting methods show remarkable improvements. We propose a multi-GAN architecture improving both the performance and rendering efficiency.
arXiv Detail & Related papers (2020-09-30T14:58:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.