LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data
- URL: http://arxiv.org/abs/2208.14889v4
- Date: Mon, 24 Apr 2023 08:14:41 GMT
- Title: LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data
- Authors: Jihye Park, Sunwoo Kim, Soohyun Kim, Seokju Cho, Jaejun Yoo, Youngjung
Uh, Seungryong Kim
- Abstract summary: We present a LANguage-driven Image-to-image Translation model, dubbed LANIT.
We leverage easy-to-obtain candidate attributes given in texts for a dataset: the similarity between images and attributes indicates per-sample domain labels.
Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models.
- Score: 39.421312439022316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing techniques for image-to-image translation commonly have suffered
from two critical problems: heavy reliance on per-sample domain annotation
and/or inability of handling multiple attributes per image. Recent
truly-unsupervised methods adopt clustering approaches to easily provide
per-sample one-hot domain labels. However, they cannot account for the
real-world setting: one sample may have multiple attributes. In addition, the
semantics of the clusters are not easily coupled to the human understanding. To
overcome these, we present a LANguage-driven Image-to-image Translation model,
dubbed LANIT. We leverage easy-to-obtain candidate attributes given in texts
for a dataset: the similarity between images and attributes indicates
per-sample domain labels. This formulation naturally enables multi-hot label so
that users can specify the target domain with a set of attributes in language.
To account for the case that the initial prompts are inaccurate, we also
present prompt learning. We further present domain regularization loss that
enforces translated images be mapped to the corresponding domain. Experiments
on several standard benchmarks demonstrate that LANIT achieves comparable or
superior performance to existing models.
Related papers
- WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - Disentangled Unsupervised Image Translation via Restricted Information
Flow [61.44666983942965]
Many state-of-art methods hard-code the desired shared-vs-specific split into their architecture.
We propose a new method that does not rely on inductive architectural biases.
We show that the proposed method achieves consistently high manipulation accuracy across two synthetic and one natural dataset.
arXiv Detail & Related papers (2021-11-26T00:27:54Z) - Rethinking the Truly Unsupervised Image-to-Image Translation [29.98784909971291]
Unsupervised image-to-image translation model (TUNIT) learns to separate image domains and translates input images into estimated domains.
Experimental results show TUNIT achieves comparable or even better performance than the set-level supervised model trained with full labels.
TUNIT can be easily extended to semi-supervised learning with a few labeled data.
arXiv Detail & Related papers (2020-06-11T15:15:12Z) - Semi-supervised Learning for Few-shot Image-to-Image Translation [89.48165936436183]
We propose a semi-supervised method for few-shot image translation, called SEMIT.
Our method achieves excellent results on four different datasets using as little as 10% of the source labels.
arXiv Detail & Related papers (2020-03-30T22:46:49Z) - GMM-UNIT: Unsupervised Multi-Domain and Multi-Modal Image-to-Image
Translation via Attribute Gaussian Mixture Modeling [66.50914391679375]
Unsupervised image-to-image translation (UNIT) aims at learning a mapping between several visual domains by using unpaired training images.
Recent studies have shown remarkable success for multiple domains but they suffer from two main limitations.
We propose a method named GMM-UNIT, which is based on a content-attribute disentangled representation where the space is fitted with a GMM.
arXiv Detail & Related papers (2020-03-15T10:18:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.