Weakly supervised cross-domain alignment with optimal transport
- URL: http://arxiv.org/abs/2008.06597v1
- Date: Fri, 14 Aug 2020 22:48:36 GMT
- Title: Weakly supervised cross-domain alignment with optimal transport
- Authors: Siyang Yuan, Ke Bai, Liqun Chen, Yizhe Zhang, Chenyang Tao, Chunyuan
Li, Guoyin Wang, Ricardo Henao, Lawrence Carin
- Abstract summary: Cross-domain alignment between image objects and text sequences is key to many visual-language tasks.
This paper investigates a novel approach for the identification and optimization of fine-grained semantic similarities between image and text entities.
- Score: 102.8572398001639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-domain alignment between image objects and text sequences is key to
many visual-language tasks, and it poses a fundamental challenge to both
computer vision and natural language processing. This paper investigates a
novel approach for the identification and optimization of fine-grained semantic
similarities between image and text entities, under a weakly-supervised setup,
improving performance over state-of-the-art solutions. Our method builds upon
recent advances in optimal transport (OT) to resolve the cross-domain matching
problem in a principled manner. Formulated as a drop-in regularizer, the
proposed OT solution can be efficiently computed and used in combination with
other existing approaches. We present empirical evidence to demonstrate the
effectiveness of our approach, showing how it enables simpler model
architectures to outperform or be comparable with more sophisticated designs on
a range of vision-language tasks.
Related papers
- Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space.
We propose an effective approach to narrow the gap between the two domains.
It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z) - A Multimodal Approach for Cross-Domain Image Retrieval [5.5547914920738]
Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision.
This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models.
Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation.
arXiv Detail & Related papers (2024-03-22T12:08:16Z) - OT-Attack: Enhancing Adversarial Transferability of Vision-Language
Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples.
Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples.
We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z) - Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.
We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework.
We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - Marginal Contrastive Correspondence for Guided Image Generation [58.0605433671196]
Exemplar-based image translation establishes dense correspondences between a conditional input and an exemplar from two different domains.
Existing work builds the cross-domain correspondences implicitly by minimizing feature-wise distances across the two domains.
We design a Marginal Contrastive Learning Network (MCL-Net) that explores contrastive learning to learn domain-invariant features for realistic exemplar-based image translation.
arXiv Detail & Related papers (2022-04-01T13:55:44Z) - Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with
Adversarial Discriminative Domain Regularization [21.904563910555368]
We propose a novel learning framework to construct a set of discriminative data domains within each image-text pairs.
Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks.
arXiv Detail & Related papers (2020-10-23T01:48:37Z) - DASGIL: Domain Adaptation for Semantic and Geometric-aware Image-based
Localization [27.294822556484345]
Long-term visual localization under changing environments is a challenging problem in autonomous driving and mobile robotics.
We propose a novel multi-task architecture to fuse the geometric and semantic information into the multi-scale latent embedding representation for visual place recognition.
arXiv Detail & Related papers (2020-10-01T17:44:25Z) - TSIT: A Simple and Versatile Framework for Image-to-Image Translation [103.92203013154403]
We introduce a simple and versatile framework for image-to-image translation.
We provide a carefully designed two-stream generative model with newly proposed feature transformations.
This allows multi-scale semantic structure information and style representation to be effectively captured and fused by the network.
A systematic study compares the proposed method with several state-of-the-art task-specific baselines, verifying its effectiveness in both perceptual quality and quantitative evaluations.
arXiv Detail & Related papers (2020-07-23T15:34:06Z) - A Flexible Framework for Designing Trainable Priors with Adaptive
Smoothing and Game Encoding [57.1077544780653]
We introduce a general framework for designing and training neural network layers whose forward passes can be interpreted as solving non-smooth convex optimization problems.
We focus on convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions.
This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end.
arXiv Detail & Related papers (2020-06-26T08:34:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.