Text as Neural Operator: Image Manipulation by Text Instruction
- URL: http://arxiv.org/abs/2008.04556v4
- Date: Mon, 29 Nov 2021 16:48:56 GMT
- Title: Text as Neural Operator: Image Manipulation by Text Instruction
- Authors: Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee,
Irfan Essa
- Abstract summary: In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
- Score: 68.53181621741632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, text-guided image manipulation has gained increasing
attention in the multimedia and computer vision community. The input to
conditional image generation has evolved from image-only to multimodality. In
this paper, we study a setting that allows users to edit an image with multiple
objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2)
an instruction in natural language that describes desired modifications to the
image. We propose a GAN-based method to tackle this problem. The key idea is to
treat text as neural operators to locally modify the image feature. We show
that the proposed model performs favorably against recent strong baselines on
three public datasets. Specifically, it generates images of greater fidelity
and semantic relevance, and when used as a image query, leads to better
retrieval performance.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - What You See is What You Read? Improving Text-Image Alignment Evaluation [28.722369586165108]
We study methods for automatic text-image alignment evaluation.
We first introduce SeeTRUE, spanning multiple datasets from both text-to-image and image-to-text generation tasks.
We describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models.
arXiv Detail & Related papers (2023-05-17T17:43:38Z) - Bi-directional Training for Composed Image Retrieval via Text Prompt
Learning [46.60334745348141]
Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text.
We propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures.
Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model.
arXiv Detail & Related papers (2023-03-29T11:37:41Z) - Interactive Image Manipulation with Complex Text Instructions [14.329411711887115]
We propose a novel image manipulation method that interactively edits an image using complex text instructions.
It allows users to not only improve the accuracy of image manipulation but also achieve complex tasks such as enlarging, dwindling, or removing objects.
Extensive experiments on the Caltech-UCSD Birds-200-2011 (CUB) dataset and Microsoft Common Objects in Context (MS COCO) datasets demonstrate our proposed method can enable interactive, flexible, and accurate image manipulation in real-time.
arXiv Detail & Related papers (2022-11-25T08:05:52Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - CIGLI: Conditional Image Generation from Language & Image [5.159265382427163]
We propose a new task called CIGLI: Conditional Image Generation from Language and Image.
Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt.
arXiv Detail & Related papers (2021-08-20T00:58:42Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.