PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
- URL: http://arxiv.org/abs/2411.15867v1
- Date: Sun, 24 Nov 2024 15:06:57 GMT
- Title: PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
- Authors: Teng Zhou, Xiaoyu Zhang, Yongchuan Tang,
- Abstract summary: We introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task.
Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations.
This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability.
- Score: 10.970010947605289
- License:
- Abstract: Panoramic Image Generation has emerged as an important task in image generation, driven by growing demands for large-scale visuals in creative and technical applications. While diffusion models have dominated this field, they face inherent limitations, including the multilevel-coherence challenge and implementation complexity, leading to suboptimal outcomes. In this paper, we introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task. Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations. This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability. PanoLlama demonstrates its effectiveness and versatility in our experiments, achieving the best overall performance while offering flexibility for multi-scale, multi-layout, and multi-guidance generation. It overcomes the challenges that diffusion-based methods fail to address, setting a new paradigm for panoramic image generation tasks. Code is available at https://github.com/0606zt/PanoLlama.
Related papers
- Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation [12.588962705218103]
We introduce the Multi-Scale Diffusion (MSD) framework, a plug-and-play module that extends the existing panoramic image generation framework to multiple resolution levels.
By utilizing gradient descent techniques, our method effectively incorporates structural information from low-resolution images into high-resolution outputs.
arXiv Detail & Related papers (2024-10-24T15:18:51Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.
There exists a trade-off between reconstruction and generation quality regarding token length.
We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [48.98105914356609]
Lumina-mGPT is a family of multimodal autoregressive models capable of various vision and language tasks.
We introduce Ominiponent Supervised Finetuning, transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification.
arXiv Detail & Related papers (2024-08-05T17:46:53Z) - LSReGen: Large-Scale Regional Generator via Backward Guidance Framework [12.408195812609042]
controllable image generation remains a challenge.
Current methods, such as training, forward guidance, and backward guidance, have notable limitations.
We propose a novel controllable generation framework that offers a generalized interpretation of backward guidance.
We introduce LSReGen, a large-scale layout-to-image method designed to generate high-quality, layout-compliant images.
arXiv Detail & Related papers (2024-07-21T05:44:46Z) - Obtaining Favorable Layouts for Multiple Object Generation [50.616875565173274]
Large-scale text-to-image models can generate high-quality and diverse images based on textual prompts.
However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects.
We propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid.
This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us.
arXiv Detail & Related papers (2024-05-01T18:07:48Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation [34.61940502872307]
MultiDiffusion is a unified framework that enables versatile and controllable image generation.
We show that MultiDiffusion can be readily applied to generate high quality and diverse images.
arXiv Detail & Related papers (2023-02-16T06:28:29Z) - A Generic Approach for Enhancing GANs by Regularized Latent Optimization [79.00740660219256]
We introduce a generic framework called em generative-model inference that is capable of enhancing pre-trained GANs effectively and seamlessly.
Our basic idea is to efficiently infer the optimal latent distribution for the given requirements using Wasserstein gradient flow techniques.
arXiv Detail & Related papers (2021-12-07T05:22:50Z) - Unsupervised Cycle-consistent Generative Adversarial Networks for
Pan-sharpening [41.68141846006704]
We propose an unsupervised generative adversarial framework that learns from the full-scale images without the ground truths to alleviate this problem.
We extract the modality-specific features from the PAN and MS images with a two-stream generator, perform fusion in the feature domain, and then reconstruct the pan-sharpened images.
Results demonstrate that the proposed method can greatly improve the pan-sharpening performance on the full-scale images.
arXiv Detail & Related papers (2021-09-20T09:43:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.