Meta-Learning and Self-Supervised Pretraining for Real World Image
Translation
- URL: http://arxiv.org/abs/2112.11929v1
- Date: Wed, 22 Dec 2021 14:48:22 GMT
- Title: Meta-Learning and Self-Supervised Pretraining for Real World Image
Translation
- Authors: Ileana Rugina, Rumen Dangovski, Mark Veillette, Pooya Khorrami, Brian
Cheung, Olga Simek, Marin Solja\v{c}i\'c
- Abstract summary: We explore image-to-image translation problem in order to formulate a novel multi-task few-shot image generation benchmark.
We present several baselines for the few-shotiru problem and discuss trade-offs between different approaches.
- Score: 5.469808405577674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in deep learning, in particular enabled by hardware advances
and big data, have provided impressive results across a wide range of
computational problems such as computer vision, natural language, or
reinforcement learning. Many of these improvements are however constrained to
problems with large-scale curated data-sets which require a lot of human labor
to gather. Additionally, these models tend to generalize poorly under both
slight distributional shifts and low-data regimes. In recent years, emerging
fields such as meta-learning or self-supervised learning have been closing the
gap between proof-of-concept results and real-life applications of machine
learning by extending deep-learning to the semi-supervised and few-shot
domains. We follow this line of work and explore spatio-temporal structure in a
recently introduced image-to-image translation problem in order to: i)
formulate a novel multi-task few-shot image generation benchmark and ii)
explore data augmentations in contrastive pre-training for image translation
downstream tasks. We present several baselines for the few-shot problem and
discuss trade-offs between different approaches. Our code is available at
https://github.com/irugina/meta-image-translation.
Related papers
- A Survey of Vision-Language Pre-training from the Lens of Multimodal
Machine Translation [13.426403221815063]
This paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation.
We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.
arXiv Detail & Related papers (2023-06-12T15:56:10Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Efficient Vision-Language Pretraining with Visual Concepts and
Hierarchical Alignment [40.677139679304936]
We propose a new framework, dubbed ViCHA, that efficiently exploits the input data to boost the learning by: (a) a new hierarchical cross-modal alignment loss, (b) new self-supervised scheme based on masked image modeling, and (c) leveraging image-level annotations.
Although pretrained on four times less data, our ViCHA strategy outperforms other approaches on several downstream tasks such as Image-Text Retrieval, VQA, Visual Reasoning, Visual Entailment and Visual Grounding.
arXiv Detail & Related papers (2022-08-29T14:24:08Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Synthetic-to-Real Domain Adaptation using Contrastive Unpaired
Translation [28.19031441659854]
We propose a multi-step method to obtain training data without manual annotation effort.
From 3D object meshes, we generate images using a modern synthesis pipeline.
We utilize a state-of-the-art image-to-image translation method to adapt the synthetic images to the real domain.
arXiv Detail & Related papers (2022-03-17T17:13:23Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.