PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
- URL: http://arxiv.org/abs/2411.15867v3
- Date: Fri, 01 Aug 2025 16:25:54 GMT
- Title: PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
- Authors: Teng Zhou, Xiaoyu Zhang, Yongchuan Tang,
- Abstract summary: Panoramic Image Generation (PIG) aims to create coherent images of arbitrary lengths.<n>We propose PanoLlama, a novel framework that achieves endless and coherent panorama generation with the autoregressive paradigm.
- Score: 10.970010947605289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Panoramic Image Generation (PIG) aims to create coherent images of arbitrary lengths. Most existing methods fall in the joint diffusion paradigm, but their complex and heuristic crop connection designs often limit their ability to achieve multilevel coherence. By deconstructing this challenge into its core components, we find it naturally aligns with next-token prediction, leading us to adopt an autoregressive (AR) paradigm for PIG modeling. However, existing visual AR (VAR) models are limited to fixed-size generation, lacking the capability to produce panoramic images. In this paper, we propose PanoLlama, a novel framework that achieves endless and coherent panorama generation with the autoregressive paradigm. Our approach develops a training-free strategy that utilizes token redirection to overcome the size limitations of existing VAR models, enabling next-crop prediction in both horizontal and vertical directions. This refreshes the PIG pipeline while achieving SOTA performance in coherence (47.50%), fidelity(28.16%), and aesthetics (15%). Additionally, PanoLlama supports applications other PIG methods cannot achieve, including mask-free layout control, multi-scale and multi-guidance synthesis. To facilitate standardized evaluation, we also establish a dataset with 1,000 prompts spanning 100+ themes, providing a new testing benchmark for PIG research. The code is available at https://github.com/0606zt/PanoLlama.
Related papers
- ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation [64.84095852784714]
Residual Tokenizer (ResTok) is a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens.<n>We show that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps.
arXiv Detail & Related papers (2026-01-07T14:09:18Z) - HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning [66.99487505369254]
HiCoGen is built upon a novel Chain of Synthesis paradigm.<n>It decomposes complex prompts into minimal semantic units.<n>It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next.<n>Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
arXiv Detail & Related papers (2025-11-25T06:24:25Z) - ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View [11.346049532150127]
textbfARSS is a framework that generates novel views from a single image conditioned on a camera trajectory.<n>Our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models.
arXiv Detail & Related papers (2025-09-27T00:03:09Z) - Omnidirectional Spatial Modeling from Correlated Panoramas [4.75637997496421]
Existing omnidirectional methods achieve scene understanding within a single frame while neglecting cross-frame correlated panoramas.<n>We introduce textbfCFpano, the textbffirst benchmark dataset dedicated to cross-frame correlated panoramas visual question answering.<n>We present methodname, a multi-modal large language model (MLLM) fine-tuned with Group Relative Policy Optimization ( GRPO) and a set of tailored reward functions for robust and consistent reasoning with cross-frame correlated panoramas.
arXiv Detail & Related papers (2025-09-02T10:14:55Z) - PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter [54.33433051500349]
We propose Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model.<n>We also propose a geometry-constrained gate prompt generator (G2PG) shared across different layers.
arXiv Detail & Related papers (2025-05-27T09:27:16Z) - Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots [103.48424042986271]
We introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens.<n>We present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling.
arXiv Detail & Related papers (2025-05-26T17:59:07Z) - Conditional Panoramic Image Generation via Masked Autoregressive Modeling [35.624070746282186]
We propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges.<n>To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence.<n>Experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks.
arXiv Detail & Related papers (2025-05-22T16:20:12Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.
We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.
We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.
We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching [34.112157859384645]
We introduce FlowAR, a next scale prediction method featuring a streamlined scale design.
This eliminates the need for VAR's intricate multi-scale residual tokenizer.
We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark.
arXiv Detail & Related papers (2024-12-19T18:59:31Z) - M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation [39.97174784206976]
We show that this scale-wise autoregressive framework can be effectively decoupled into textitintra-scale modeling
We apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead.
Experiments demonstrate that our method outperforms existing models in both image quality and generation speed.
arXiv Detail & Related papers (2024-11-15T18:54:42Z) - Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation [12.588962705218103]
We introduce the Multi-Scale Diffusion (MSD) framework, a plug-and-play module that extends the existing panoramic image generation framework to multiple resolution levels.
By utilizing gradient descent techniques, our method effectively incorporates structural information from low-resolution images into high-resolution outputs.
arXiv Detail & Related papers (2024-10-24T15:18:51Z) - Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective [52.778766190479374]
Latent-based image generative models have achieved notable success in image generation tasks.
Despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation.
We propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling.
arXiv Detail & Related papers (2024-10-16T12:13:17Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.
There exists a trade-off between reconstruction and generation quality regarding token length.
We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [48.98105914356609]
Lumina-mGPT is a family of multimodal autoregressive models capable of various vision and language tasks.
We introduce Ominiponent Supervised Finetuning, transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification.
arXiv Detail & Related papers (2024-08-05T17:46:53Z) - LSReGen: Large-Scale Regional Generator via Backward Guidance Framework [12.408195812609042]
controllable image generation remains a challenge.
Current methods, such as training, forward guidance, and backward guidance, have notable limitations.
We propose a novel controllable generation framework that offers a generalized interpretation of backward guidance.
We introduce LSReGen, a large-scale layout-to-image method designed to generate high-quality, layout-compliant images.
arXiv Detail & Related papers (2024-07-21T05:44:46Z) - Obtaining Favorable Layouts for Multiple Object Generation [50.616875565173274]
Large-scale text-to-image models can generate high-quality and diverse images based on textual prompts.
However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects.
We propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid.
This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us.
arXiv Detail & Related papers (2024-05-01T18:07:48Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation [34.61940502872307]
MultiDiffusion is a unified framework that enables versatile and controllable image generation.
We show that MultiDiffusion can be readily applied to generate high quality and diverse images.
arXiv Detail & Related papers (2023-02-16T06:28:29Z) - Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation.
The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context.
The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z) - Multiscale Latent-Guided Entropy Model for LiDAR Point Cloud Compression [18.897023700334458]
The non-uniform distribution and extremely sparse nature of the LiDAR point cloud (LPC) bring significant challenges to its high-efficient compression.
This paper proposes a novel end-to-end, fully-factorized deep framework that encodes the original LPC into an octree structure and hierarchically decomposes the octree entropy model in layers.
arXiv Detail & Related papers (2022-09-26T08:36:11Z) - GLEAN: Generative Latent Bank for Image Super-Resolution and Beyond [99.6233044915999]
We show that pre-trained Generative Adversarial Networks (GANs) such as StyleGAN and BigGAN can be used as a latent bank to improve the performance of image super-resolution.
Our method, Generative LatEnt bANk (GLEAN), goes beyond existing practices by directly leveraging rich and diverse priors encapsulated in a pre-trained GAN.
We extend our method to different tasks including image colorization and blind image restoration, and extensive experiments show that our proposed models perform favorably in comparison to existing methods.
arXiv Detail & Related papers (2022-07-29T17:59:01Z) - A new perspective on probabilistic image modeling [92.89846887298852]
We present a new probabilistic approach for image modeling capable of density estimation, sampling and tractable inference.
DCGMMs can be trained end-to-end by SGD from random initial conditions, much like CNNs.
We show that DCGMMs compare favorably to several recent PC and SPN models in terms of inference, classification and sampling.
arXiv Detail & Related papers (2022-03-21T14:53:57Z) - A Generic Approach for Enhancing GANs by Regularized Latent Optimization [79.00740660219256]
We introduce a generic framework called em generative-model inference that is capable of enhancing pre-trained GANs effectively and seamlessly.
Our basic idea is to efficiently infer the optimal latent distribution for the given requirements using Wasserstein gradient flow techniques.
arXiv Detail & Related papers (2021-12-07T05:22:50Z) - Unsupervised Cycle-consistent Generative Adversarial Networks for
Pan-sharpening [41.68141846006704]
We propose an unsupervised generative adversarial framework that learns from the full-scale images without the ground truths to alleviate this problem.
We extract the modality-specific features from the PAN and MS images with a two-stream generator, perform fusion in the feature domain, and then reconstruct the pan-sharpened images.
Results demonstrate that the proposed method can greatly improve the pan-sharpening performance on the full-scale images.
arXiv Detail & Related papers (2021-09-20T09:43:24Z) - Locally Masked Convolution for Autoregressive Models [107.4635841204146]
LMConv is a simple modification to the standard 2D convolution that allows arbitrary masks to be applied to the weights at each location in the image.
We learn an ensemble of distribution estimators that share parameters but differ in generation order, achieving improved performance on whole-image density estimation.
arXiv Detail & Related papers (2020-06-22T17:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.