Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation
- URL: http://arxiv.org/abs/2309.03467v2
- Date: Mon, 8 Apr 2024 07:49:47 GMT
- Title: Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation
- Authors: Zhuqiang Lu, Kun Hu, Chaoyue Wang, Lei Bai, Zhiyong Wang,
- Abstract summary: AOG-Net is proposed for 360-degree image generation by out-painting an incomplete image progressively with NFoV and text guidances joinly or individually.
A global-local conditioning mechanism is devised to formulate the outpainting guidance in each autoregressive step.
Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method.
- Score: 36.45222068699805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by out-painting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code will be made publicly available.
Related papers
- OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting [9.870063736691556]
We tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images.
This task aims to predict the reasonable and consistent surroundings from the NFoV images.
We propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba.
arXiv Detail & Related papers (2024-07-15T17:23:00Z) - Taming Stable Diffusion for Text to 360° Panorama Image Generation [74.69314801406763]
We introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt.
We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process.
arXiv Detail & Related papers (2024-04-11T17:46:14Z) - Dream360: Diverse and Immersive Outdoor Virtual Scene Creation via
Transformer-Based 360 Image Outpainting [33.95741744421632]
We propose a transformer-based 360 image outpainting framework called Dream360.
It can generate diverse, high-fidelity, and high-resolution panoramas from user-selected viewports.
Our Dream360 achieves significantly lower Frechet Inception Distance (FID) scores and better visual fidelity than existing methods.
arXiv Detail & Related papers (2024-01-19T09:01:20Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - UpFusion: Novel View Diffusion from Unposed Sparse View Observations [66.36092764694502]
UpFusion can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images.
We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images.
arXiv Detail & Related papers (2023-12-11T18:59:55Z) - HawkI: Homography & Mutual Information Guidance for 3D-free Single Image to Aerial View [67.8213192993001]
We present HawkI, for synthesizing aerial-view images from text and an exemplar image.
HawkI blends the visual features from the input image within a pretrained text-to-2Dimage stable diffusion model.
At inference, HawkI employs a unique mutual information guidance formulation to steer the generated image towards faithfully replicating the semantic details of the input-image.
arXiv Detail & Related papers (2023-11-27T01:41:25Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Capturing Omni-Range Context for Omnidirectional Segmentation [29.738065412097598]
We introduce Concurrent Attention Networks (ECANets) to bridge the gap in terms of FoV and structural distribution between the imaging domains.
We upgrade model training by leveraging multi-source and omni-supervised learning, taking advantage of both: Densely labeled and unlabeled data.
Our novel model, training regimen and multisource prediction fusion elevate the performance (mIoU) to new state-of-the-art results.
arXiv Detail & Related papers (2021-03-09T19:46:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.