Related papers: Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

URL: http://arxiv.org/abs/2010.12136v1
Date: Fri, 23 Oct 2020 02:43:02 GMT
Title: Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation
Authors: Bowen Li, Xiaojuan Qi, Philip H. S. Torr, Thomas Lukasiewicz
Abstract summary: We propose a novel lightweight generative adversarial network for efficient image manipulation using natural language descriptions. A new word-level discriminator is proposed, which provides the generator with fine-grained training feedback at word-level.
Score: 139.41321867508722
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel lightweight generative adversarial network for efficient image manipulation using natural language descriptions. To achieve this, a new word-level discriminator is proposed, which provides the generator with fine-grained training feedback at word-level, to facilitate training a lightweight generator that has a small number of parameters, but can still correctly focus on specific visual attributes of an image, and then edit them without affecting other contents that are not described in the text. Furthermore, thanks to the explicit training signal related to each word, the discriminator can also be simplified to have a lightweight structure. Compared with the state of the art, our method has a much smaller number of parameters, but still achieves a competitive manipulation performance. Extensive experimental results demonstrate that our method can better disentangle different visual attributes, then correctly map them to corresponding semantic words, and thus achieve a more accurate image modification using natural language descriptions.

Related papers

FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL [78.59912944698992]
We propose FocusDiff to enhance fine-grained text-image semantic alignment.<n>We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics.<n>Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.
arXiv Detail & Related papers (2025-06-05T18:36:33Z)
FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings. Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z)
Text as Image: Learning Transferable Adapter for Multi-Label Classification [13.11583340598517]
We introduce an effective approach to employ large language models for multi-label instruction-following text generation. In this way, a fully automated pipeline for visual label recognition is developed without relying on any manual data.
arXiv Detail & Related papers (2023-12-07T09:22:20Z)
CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing [22.40686064568406]
We present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds.
arXiv Detail & Related papers (2023-07-17T11:29:48Z)
WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models [8.334487584550185]
We present a latent diffusion-based method for styled text-to-text-content-image generation on word-level. Our proposed method is able to generate realistic word image samples from different writer styles. We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data.
arXiv Detail & Related papers (2023-03-29T10:19:26Z)
CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation [4.078926358349661]
Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. Due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation.
arXiv Detail & Related papers (2022-10-08T05:12:25Z)
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. In this work, we repurpose such models to generate a descriptive text given an image at inference time. The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning. In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation. We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt. Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z)
Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance. We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images. Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z)
Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks. We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.