How do Cross-View and Cross-Modal Alignment Affect Representations in
Contrastive Learning?
- URL: http://arxiv.org/abs/2211.13309v1
- Date: Wed, 23 Nov 2022 21:26:25 GMT
- Title: How do Cross-View and Cross-Modal Alignment Affect Representations in
Contrastive Learning?
- Authors: Thomas M. Hehn, Julian F.P. Kooij, Dariu M. Gavrila
- Abstract summary: Cross-modal representation alignment discards complementary visual information, such as color and texture, and instead emphasizes redundant depth cues.
Overall, cross-modal alignment leads to more robust encoders than pre-training by cross-view alignment.
- Score: 8.594140167290098
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Various state-of-the-art self-supervised visual representation learning
approaches take advantage of data from multiple sensors by aligning the feature
representations across views and/or modalities. In this work, we investigate
how aligning representations affects the visual features obtained from
cross-view and cross-modal contrastive learning on images and point clouds. On
five real-world datasets and on five tasks, we train and evaluate 108 models
based on four pretraining variations. We find that cross-modal representation
alignment discards complementary visual information, such as color and texture,
and instead emphasizes redundant depth cues. The depth cues obtained from
pretraining improve downstream depth prediction performance. Also overall,
cross-modal alignment leads to more robust encoders than pre-training by
cross-view alignment, especially on depth prediction, instance segmentation,
and object detection.
Related papers
- Learning and Transferring Better with Depth Information in Visual Reinforcement Learning [2.1944315483245465]
A visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization.<n>Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer.<n>For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes.
arXiv Detail & Related papers (2025-07-12T07:58:02Z) - Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [41.81009725976217]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework.
We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z) - Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z) - Variational Structured Attention Networks for Deep Visual Representation
Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner.
Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework.
We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z) - Learning Visual Representations for Transfer Learning by Suppressing
Texture [38.901410057407766]
In self-supervised learning, texture as a low-level cue may provide shortcuts that prevent the network from learning higher level representations.
We propose to use classic methods based on anisotropic diffusion to augment training using images with suppressed texture.
We empirically show that our method achieves state-of-the-art results on object detection and image classification.
arXiv Detail & Related papers (2020-11-03T18:27:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.