Weakly supervised cross-domain alignment with optimal transport
- URL: http://arxiv.org/abs/2008.06597v1
- Date: Fri, 14 Aug 2020 22:48:36 GMT
- Title: Weakly supervised cross-domain alignment with optimal transport
- Authors: Siyang Yuan, Ke Bai, Liqun Chen, Yizhe Zhang, Chenyang Tao, Chunyuan
Li, Guoyin Wang, Ricardo Henao, Lawrence Carin
- Abstract summary: Cross-domain alignment between image objects and text sequences is key to many visual-language tasks.
This paper investigates a novel approach for the identification and optimization of fine-grained semantic similarities between image and text entities.
- Score: 102.8572398001639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-domain alignment between image objects and text sequences is key to
many visual-language tasks, and it poses a fundamental challenge to both
computer vision and natural language processing. This paper investigates a
novel approach for the identification and optimization of fine-grained semantic
similarities between image and text entities, under a weakly-supervised setup,
improving performance over state-of-the-art solutions. Our method builds upon
recent advances in optimal transport (OT) to resolve the cross-domain matching
problem in a principled manner. Formulated as a drop-in regularizer, the
proposed OT solution can be efficiently computed and used in combination with
other existing approaches. We present empirical evidence to demonstrate the
effectiveness of our approach, showing how it enables simpler model
architectures to outperform or be comparable with more sophisticated designs on
a range of vision-language tasks.
Related papers
- Toward Real-Time Edge AI: Model-Agnostic Task-Oriented Communication with Visual Feature Alignment [23.796344455232227]
Task-oriented communication presents a promising approach to improve the communication efficiency of edge inference systems.
Real-time applications face practical challenges, such as incomplete coverage and potential malfunctions of edge servers.
This study introduces a novel framework that utilizes shared anchor data across diverse systems.
arXiv Detail & Related papers (2024-12-01T15:52:05Z) - A Multimodal Approach for Cross-Domain Image Retrieval [5.5547914920738]
Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision.
This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models.
Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation.
arXiv Detail & Related papers (2024-03-22T12:08:16Z) - OT-Attack: Enhancing Adversarial Transferability of Vision-Language
Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples.
Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples.
We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Marginal Contrastive Correspondence for Guided Image Generation [58.0605433671196]
Exemplar-based image translation establishes dense correspondences between a conditional input and an exemplar from two different domains.
Existing work builds the cross-domain correspondences implicitly by minimizing feature-wise distances across the two domains.
We design a Marginal Contrastive Learning Network (MCL-Net) that explores contrastive learning to learn domain-invariant features for realistic exemplar-based image translation.
arXiv Detail & Related papers (2022-04-01T13:55:44Z) - Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with
Adversarial Discriminative Domain Regularization [21.904563910555368]
We propose a novel learning framework to construct a set of discriminative data domains within each image-text pairs.
Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks.
arXiv Detail & Related papers (2020-10-23T01:48:37Z) - DASGIL: Domain Adaptation for Semantic and Geometric-aware Image-based
Localization [27.294822556484345]
Long-term visual localization under changing environments is a challenging problem in autonomous driving and mobile robotics.
We propose a novel multi-task architecture to fuse the geometric and semantic information into the multi-scale latent embedding representation for visual place recognition.
arXiv Detail & Related papers (2020-10-01T17:44:25Z) - TSIT: A Simple and Versatile Framework for Image-to-Image Translation [103.92203013154403]
We introduce a simple and versatile framework for image-to-image translation.
We provide a carefully designed two-stream generative model with newly proposed feature transformations.
This allows multi-scale semantic structure information and style representation to be effectively captured and fused by the network.
A systematic study compares the proposed method with several state-of-the-art task-specific baselines, verifying its effectiveness in both perceptual quality and quantitative evaluations.
arXiv Detail & Related papers (2020-07-23T15:34:06Z) - A Flexible Framework for Designing Trainable Priors with Adaptive
Smoothing and Game Encoding [57.1077544780653]
We introduce a general framework for designing and training neural network layers whose forward passes can be interpreted as solving non-smooth convex optimization problems.
We focus on convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions.
This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end.
arXiv Detail & Related papers (2020-06-26T08:34:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.