OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting
- URL: http://arxiv.org/abs/2407.10923v1
- Date: Mon, 15 Jul 2024 17:23:00 GMT
- Title: OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting
- Authors: Penglei Gao, Kai Yao, Tiandi Ye, Steven Wang, Yuan Yao, Xiaofeng Wang,
- Abstract summary: We tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images.
This task aims to predict the reasonable and consistent surroundings from the NFoV images.
We propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba.
- Score: 9.870063736691556
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur substantial memory usage and computational expense. They also have limitations in maintaining visual continuity across the entire 360-degree images, which could cause inconsistent texture and style generation. To solve the aforementioned issues, we propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba to utilize its long-sequence modelling and spatial continuity. Furthermore, incorporating textual information is an effective strategy for guiding image generation, enriching the process with detailed context and increasing diversity. Efficiently extracting textual features and integrating them with image attributes presents a significant challenge for 360-degree image out-painting. To address this, we develop two modules, Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA). VCR enhances contextual richness by fusing the modified text features with the image features, while GMA provides adaptive state-selective conditions by capturing the information flow from global to local representations. Our proposed method achieves state-of-the-art performance with extensive experiments on two broadly used 360-degree image datasets, including indoor and outdoor settings.
Related papers
- Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild.
The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability.
The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z) - VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model [76.02314305164595]
This work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users.
We take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image.
In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts.
arXiv Detail & Related papers (2024-06-03T07:14:19Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion [14.293042131263924]
In image fusion tasks, images from different sources possess distinct characteristics.
Mamba, as a state space model, has emerged in the field of natural language processing.
Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task.
arXiv Detail & Related papers (2024-04-14T16:09:33Z) - Taming Stable Diffusion for Text to 360° Panorama Image Generation [74.69314801406763]
We introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt.
We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process.
arXiv Detail & Related papers (2024-04-11T17:46:14Z) - Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance [17.251982243534144]
LAR-Gen is a novel approach for image inpainting that enables seamless inpainting of masked scene images.
Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence.
Experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency.
arXiv Detail & Related papers (2024-03-28T16:07:55Z) - 3D-aware Image Generation and Editing with Multi-modal Conditions [6.444512435220748]
3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision.
We propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs.
Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image.
arXiv Detail & Related papers (2024-03-11T07:10:37Z) - Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation [36.45222068699805]
AOG-Net is proposed for 360-degree image generation by out-painting an incomplete image progressively with NFoV and text guidances joinly or individually.
A global-local conditioning mechanism is devised to formulate the outpainting guidance in each autoregressive step.
Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method.
arXiv Detail & Related papers (2023-09-07T03:22:59Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Guided Co-Modulated GAN for 360{\deg} Field of View Extrapolation [15.850166450573756]
We propose a method to extrapolate a 360deg field of view from a single image.
Our method obtains state-of-the-art results and outperforms previous methods on standard image quality metrics.
arXiv Detail & Related papers (2022-04-15T01:48:35Z) - Bridging Composite and Real: Towards End-to-end Deep Image Matting [88.79857806542006]
We study the roles of semantics and details for image matting.
We propose a novel Glance and Focus Matting network (GFM), which employs a shared encoder and two separate decoders.
Comprehensive empirical studies have demonstrated that GFM outperforms state-of-the-art methods.
arXiv Detail & Related papers (2020-10-30T10:57:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.