COBRA: Contrastive Bi-Modal Representation Algorithm
- URL: http://arxiv.org/abs/2005.03687v2
- Date: Sun, 24 May 2020 20:07:52 GMT
- Title: COBRA: Contrastive Bi-Modal Representation Algorithm
- Authors: Vishaal Udandarao, Abhishek Maiti, Deepak Srivatsav, Suryatej Reddy
Vyalla, Yifang Yin, Rajiv Ratn Shah
- Abstract summary: We present a novel framework that aims to train two modalities in a joint fashion inspired by Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms.
We empirically show that this framework reduces the modality gap significantly and generates a robust and task agnostic joint-embedding space.
We outperform existing work on four diverse downstream tasks spanning across seven benchmark cross-modal datasets.
- Score: 43.33840912256077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are a wide range of applications that involve multi-modal data, such as
cross-modal retrieval, visual question-answering, and image captioning. Such
applications are primarily dependent on aligned distributions of the different
constituent modalities. Existing approaches generate latent embeddings for each
modality in a joint fashion by representing them in a common manifold. However
these joint embedding spaces fail to sufficiently reduce the modality gap,
which affects the performance in downstream tasks. We hypothesize that these
embeddings retain the intra-class relationships but are unable to preserve the
inter-class dynamics. In this paper, we present a novel framework COBRA that
aims to train two modalities (image and text) in a joint fashion inspired by
the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE)
paradigms which preserve both inter and intra-class relationships. We
empirically show that this framework reduces the modality gap significantly and
generates a robust and task agnostic joint-embedding space. We outperform
existing work on four diverse downstream tasks spanning across seven benchmark
cross-modal datasets.
Related papers
- Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching [10.709744162565274]
We propose a novel method called DIAS to bridge the modality gap from two aspects.
The method achieves 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO benchmarks.
arXiv Detail & Related papers (2024-10-22T09:37:29Z) - Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations.
These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings.
We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Mitigating Modality Collapse in Multimodal VAEs via Impartial
Optimization [7.4262579052708535]
We argue that this effect is a consequence of conflicting gradients during multimodal VAE training.
We show how to detect the sub-graphs in the computational graphs where gradients conflict.
We empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
arXiv Detail & Related papers (2022-06-09T13:29:25Z) - VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal
Document Classification [3.7798600249187295]
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task.
In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues.
The proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities.
arXiv Detail & Related papers (2022-05-24T12:28:12Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.