BATINet: Background-Aware Text to Image Synthesis and Manipulation
Network
- URL: http://arxiv.org/abs/2308.05921v1
- Date: Fri, 11 Aug 2023 03:22:33 GMT
- Title: BATINet: Background-Aware Text to Image Synthesis and Manipulation
Network
- Authors: Ryugo Morita, Zhiqiang Zhang, Jinjia Zhou
- Abstract summary: We analyzed a novel Background-Aware Text2Image (BAT2I) task in which the generated content matches the input background.
We proposed a Background-Aware Text to Image synthesis and manipulation Network (BATINet), which contains two key components.
We demonstrated through qualitative and quantitative evaluations on the CUB dataset that the proposed model outperforms other state-of-the-art methods.
- Score: 12.924990882126105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background-Induced Text2Image (BIT2I) aims to generate foreground content
according to the text on the given background image. Most studies focus on
generating high-quality foreground content, although they ignore the
relationship between the two contents. In this study, we analyzed a novel
Background-Aware Text2Image (BAT2I) task in which the generated content matches
the input background. We proposed a Background-Aware Text to Image synthesis
and manipulation Network (BATINet), which contains two key components: Position
Detect Network (PDN) and Harmonize Network (HN). The PDN detects the most
plausible position of the text-relevant object in the background image. The HN
harmonizes the generated content referring to background style information.
Finally, we reconstructed the generation network, which consists of the
multi-GAN and attention module to match more user preferences. Moreover, we can
apply BATINet to text-guided image manipulation. It solves the most challenging
task of manipulating the shape of an object. We demonstrated through
qualitative and quantitative evaluations on the CUB dataset that the proposed
model outperforms other state-of-the-art methods.
Related papers
- Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Image-text Retrieval via Preserving Main Semantics of Vision [5.376441473801597]
This paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL)
We leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image.
Experiments on two benchmark datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2023-04-20T12:23:29Z) - Weakly Supervised Realtime Dynamic Background Subtraction [8.75682288556859]
We propose a weakly supervised framework that can perform background subtraction without requiring per-pixel ground-truth labels.
Our framework is trained on a moving object-free sequence of images and comprises two networks.
Our proposed method is online, real-time, efficient, and requires minimal frame-level annotation.
arXiv Detail & Related papers (2023-03-06T03:17:48Z) - Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid [102.24539566851809]
Restoring reasonable and realistic content for arbitrary missing regions in images is an important yet challenging task.
Recent image inpainting models have made significant progress in generating vivid visual details, but they can still lead to texture blurring or structural distortions.
We propose the Semantic Pyramid Network (SPN) motivated by the idea that learning multi-scale semantic priors can greatly benefit the recovery of locally missing content in images.
arXiv Detail & Related papers (2021-12-08T04:33:33Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - BachGAN: High-Resolution Image Synthesis from Salient Object Layout [78.51640906030244]
We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout.
Two main challenges spring from this new task: (i) how to generate fine-grained details and realistic textures without segmentation map input; and (ii) how to create a background and weave it seamlessly into standalone objects.
By generating the hallucinated background representation dynamically, our model can synthesize high-resolution images with both photo-realistic foreground and integral background.
arXiv Detail & Related papers (2020-03-26T00:54:44Z) - SwapText: Image Based Texts Transfer in Scenes [13.475726959175057]
We present SwapText, a framework to transfer texts across scene images.
A novel text swapping network is proposed to replace text labels only in the foreground image.
The generated foreground image and background image are used to generate the word image by the fusion network.
arXiv Detail & Related papers (2020-03-18T11:02:17Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z) - Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images.
The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training.
Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.