SSCR: Iterative Language-Based Image Editing via Self-Supervised
Counterfactual Reasoning
- URL: http://arxiv.org/abs/2009.09566v2
- Date: Tue, 29 Sep 2020 00:24:25 GMT
- Title: SSCR: Iterative Language-Based Image Editing via Self-Supervised
Counterfactual Reasoning
- Authors: Tsu-Jui Fu, Xin Eric Wang, Scott Grafton, Miguel Eckstein, William
Yang Wang
- Abstract summary: Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative instructions to edit images step by step.
Data scarcity is a significant issue for ILBIE as it is challenging to collect large-scale examples of images before and after instruction-based changes.
We introduce a Self-Supervised Counterfactual Reasoning framework that incorporates counterfactual thinking to overcome data scarcity.
- Score: 79.30956389694184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative
instructions to edit images step by step. Data scarcity is a significant issue
for ILBIE as it is challenging to collect large-scale examples of images before
and after instruction-based changes. However, humans still accomplish these
editing tasks even when presented with an unfamiliar image-instruction pair.
Such ability results from counterfactual thinking and the ability to think
about alternatives to events that have happened already. In this paper, we
introduce a Self-Supervised Counterfactual Reasoning (SSCR) framework that
incorporates counterfactual thinking to overcome data scarcity. SSCR allows the
model to consider out-of-distribution instructions paired with previous images.
With the help of cross-task consistency (CTC), we train these counterfactual
instructions in a self-supervised scenario. Extensive results show that SSCR
improves the correctness of ILBIE in terms of both object identity and
position, establishing a new state of the art (SOTA) on two IBLIE datasets
(i-CLEVR and CoDraw). Even with only 50% of the training data, SSCR achieves a
comparable result to using complete data.
Related papers
- MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images.
In this paper, we propose a two-stage framework to tackle both discrepancies.
MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z) - Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity [2.724141845301679]
Composed image retrieval (CIR) formulates the query as a combination of a reference image and modified text.
We introduce a training-free approach for ZS-CIR.
Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.
arXiv Detail & Related papers (2024-09-07T21:52:58Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Exploring Text-Guided Single Image Editing for Remote Sensing Images [30.23541304590692]
This paper proposes a text-guided RSI editing method that is controllable but stable, and can be trained using only a single image.
It adopts a multi-scale training approach to preserve consistency without the need for training on extensive benchmark datasets.
arXiv Detail & Related papers (2024-05-09T13:45:04Z) - Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption.
We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations.
We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.