Next Visual Granularity Generation
- URL: http://arxiv.org/abs/2508.12811v1
- Date: Mon, 18 Aug 2025 10:47:37 GMT
- Title: Next Visual Granularity Generation
- Authors: Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy,
- Abstract summary: We propose a novel approach to image generation by decomposing an image into a structured sequence.<n>Next Visual Granularity (NVG) generation framework generates a visual granularity sequence.<n>We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior.
- Score: 58.22272282205028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.
Related papers
- UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing [33.64590153603506]
We present UniRef-Image-Edit, a high-performance multi-modal generation system.<n>It unifies single-image editing and multi-image composition within a single framework.
arXiv Detail & Related papers (2026-02-15T15:24:03Z) - NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation [66.92488610008519]
NextFlow is a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens.<n>By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow activates multimodal understanding and generation capabilities.<n>NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
arXiv Detail & Related papers (2026-01-05T15:27:04Z) - GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation [77.13582457917418]
We train a generative model solely on grid images comprising subsampled frames.<n>We learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames.<n>Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.
arXiv Detail & Related papers (2025-12-24T16:46:04Z) - TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z) - MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization [43.12251414524675]
Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation.<n>We propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning.
arXiv Detail & Related papers (2025-04-01T17:39:19Z) - NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model [57.92709692193132]
NovelGS is a diffusion model for Gaussian Splatting given sparse-view images.
We leverage the novel view denoising through a transformer-based network to generate 3D Gaussians.
arXiv Detail & Related papers (2024-11-25T07:57:17Z) - AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity [85.44800864697464]
We introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction.<n>Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks.
arXiv Detail & Related papers (2024-09-20T10:50:21Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - Kosmos-G: Generating Images in Context with Multimodal Large Language Models [117.0259361818715]
Current subject-driven image generation methods require test-time tuning and cannot accept interleaved multi-image and text input.
This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models.
Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input.
arXiv Detail & Related papers (2023-10-04T17:28:44Z) - Transformer-based Image Generation from Scene Graphs [11.443097632746763]
Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image.
Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation.
We show how employing multi-head attention to encode the graph information can improve the quality of the sampled data.
arXiv Detail & Related papers (2023-03-08T14:54:51Z) - Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation.
The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context.
The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.