SSCR: Iterative Language-Based Image Editing via Self-Supervised
Counterfactual Reasoning
- URL: http://arxiv.org/abs/2009.09566v2
- Date: Tue, 29 Sep 2020 00:24:25 GMT
- Title: SSCR: Iterative Language-Based Image Editing via Self-Supervised
Counterfactual Reasoning
- Authors: Tsu-Jui Fu, Xin Eric Wang, Scott Grafton, Miguel Eckstein, William
Yang Wang
- Abstract summary: Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative instructions to edit images step by step.
Data scarcity is a significant issue for ILBIE as it is challenging to collect large-scale examples of images before and after instruction-based changes.
We introduce a Self-Supervised Counterfactual Reasoning framework that incorporates counterfactual thinking to overcome data scarcity.
- Score: 79.30956389694184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative
instructions to edit images step by step. Data scarcity is a significant issue
for ILBIE as it is challenging to collect large-scale examples of images before
and after instruction-based changes. However, humans still accomplish these
editing tasks even when presented with an unfamiliar image-instruction pair.
Such ability results from counterfactual thinking and the ability to think
about alternatives to events that have happened already. In this paper, we
introduce a Self-Supervised Counterfactual Reasoning (SSCR) framework that
incorporates counterfactual thinking to overcome data scarcity. SSCR allows the
model to consider out-of-distribution instructions paired with previous images.
With the help of cross-task consistency (CTC), we train these counterfactual
instructions in a self-supervised scenario. Extensive results show that SSCR
improves the correctness of ILBIE in terms of both object identity and
position, establishing a new state of the art (SOTA) on two IBLIE datasets
(i-CLEVR and CoDraw). Even with only 50% of the training data, SSCR achieves a
comparable result to using complete data.
Related papers
- UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency [69.33072075580483]
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training.
Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency ( CEC)
CEC applies forward and backward edits in one training step and enforces consistency in image and attention spaces.
arXiv Detail & Related papers (2024-12-19T18:59:58Z) - Compositional Image Retrieval via Instruction-Aware Contrastive Learning [40.54022628032561]
Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference.
In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable.
We propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation.
arXiv Detail & Related papers (2024-12-07T22:46:52Z) - MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images.
In this paper, we propose a two-stage framework to tackle both discrepancies.
MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z) - Exploring Text-Guided Single Image Editing for Remote Sensing Images [30.23541304590692]
This paper proposes a text-guided RSI editing method that is controllable but stable, and can be trained using only a single image.
It adopts a multi-scale training approach to preserve consistency without the need for training on extensive benchmark datasets.
arXiv Detail & Related papers (2024-05-09T13:45:04Z) - Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption.
We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations.
We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.