Related papers: OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting

OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting

URL: http://arxiv.org/abs/2407.10923v1
Date: Mon, 15 Jul 2024 17:23:00 GMT
Title: OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting
Authors: Penglei Gao, Kai Yao, Tiandi Ye, Steven Wang, Yuan Yao, Xiaofeng Wang,
Abstract summary: We tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images. This task aims to predict the reasonable and consistent surroundings from the NFoV images. We propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba.
Score: 9.870063736691556
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur substantial memory usage and computational expense. They also have limitations in maintaining visual continuity across the entire 360-degree images, which could cause inconsistent texture and style generation. To solve the aforementioned issues, we propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba to utilize its long-sequence modelling and spatial continuity. Furthermore, incorporating textual information is an effective strategy for guiding image generation, enriching the process with detailed context and increasing diversity. Efficiently extracting textual features and integrating them with image attributes presents a significant challenge for 360-degree image out-painting. To address this, we develop two modules, Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA). VCR enhances contextual richness by fusing the modified text features with the image features, while GMA provides adaptive state-selective conditions by capturing the information flow from global to local representations. Our proposed method achieves state-of-the-art performance with extensive experiments on two broadly used 360-degree image datasets, including indoor and outdoor settings.

Related papers

A Survey on Text-Driven 360-Degree Panorama Generation [31.86065545952698]
Text-driven 360-degree panorama generation is a transformative advancement in immersive visual content creation. Recent progress in text-to-image diffusion models has accelerated the rapid development in this emerging field. This survey offers an in-depth analysis of state-of-the-art algorithms and their expanding applications in 360-degree 3D scene generation.
arXiv Detail & Related papers (2025-02-20T18:19:57Z)
Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild. The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z)
VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model [76.02314305164595]
This work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users. We take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image. In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts.
arXiv Detail & Related papers (2024-06-03T07:14:19Z)
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities. We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z)
A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion [14.293042131263924]
In image fusion tasks, images from different sources possess distinct characteristics. Mamba, as a state space model, has emerged in the field of natural language processing. Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task.
arXiv Detail & Related papers (2024-04-14T16:09:33Z)
Taming Stable Diffusion for Text to 360° Panorama Image Generation [74.69314801406763]
We introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process.
arXiv Detail & Related papers (2024-04-11T17:46:14Z)
MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation [54.64194935409982]
We introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer-wise RGBA decompositions. MuLAn is the first photorealistic resource providing instance decomposition and spatial information for high quality images. We aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions.
arXiv Detail & Related papers (2024-04-03T14:58:00Z)
Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance [17.251982243534144]
LAR-Gen is a novel approach for image inpainting that enables seamless inpainting of masked scene images. Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence. Experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency.
arXiv Detail & Related papers (2024-03-28T16:07:55Z)
3D-aware Image Generation and Editing with Multi-modal Conditions [6.444512435220748]
3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision. We propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs. Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image.
arXiv Detail & Related papers (2024-03-11T07:10:37Z)
Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation [36.45222068699805]
AOG-Net is proposed for 360-degree image generation by out-painting an incomplete image progressively with NFoV and text guidances joinly or individually. A global-local conditioning mechanism is devised to formulate the outpainting guidance in each autoregressive step. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method.
arXiv Detail & Related papers (2023-09-07T03:22:59Z)
Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation. The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z)
Guided Co-Modulated GAN for 360{\deg} Field of View Extrapolation [15.850166450573756]
We propose a method to extrapolate a 360deg field of view from a single image. Our method obtains state-of-the-art results and outperforms previous methods on standard image quality metrics.
arXiv Detail & Related papers (2022-04-15T01:48:35Z)
Bridging Composite and Real: Towards End-to-end Deep Image Matting [88.79857806542006]
We study the roles of semantics and details for image matting. We propose a novel Glance and Focus Matting network (GFM), which employs a shared encoder and two separate decoders. Comprehensive empirical studies have demonstrated that GFM outperforms state-of-the-art methods.
arXiv Detail & Related papers (2020-10-30T10:57:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.