Related papers: How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?

How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?

URL: http://arxiv.org/abs/2211.13309v1
Date: Wed, 23 Nov 2022 21:26:25 GMT
Title: How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?
Authors: Thomas M. Hehn, Julian F.P. Kooij, Dariu M. Gavrila
Abstract summary: Cross-modal representation alignment discards complementary visual information, such as color and texture, and instead emphasizes redundant depth cues. Overall, cross-modal alignment leads to more robust encoders than pre-training by cross-view alignment.
Score: 8.594140167290098
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Various state-of-the-art self-supervised visual representation learning approaches take advantage of data from multiple sensors by aligning the feature representations across views and/or modalities. In this work, we investigate how aligning representations affects the visual features obtained from cross-view and cross-modal contrastive learning on images and point clouds. On five real-world datasets and on five tasks, we train and evaluate 108 models based on four pretraining variations. We find that cross-modal representation alignment discards complementary visual information, such as color and texture, and instead emphasizes redundant depth cues. The depth cues obtained from pretraining improve downstream depth prediction performance. Also overall, cross-modal alignment leads to more robust encoders than pre-training by cross-view alignment, especially on depth prediction, instance segmentation, and object detection.

Related papers

Learning and Transferring Better with Depth Information in Visual Reinforcement Learning [2.1944315483245465]
A visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization.<n>Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer.<n>For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes.
arXiv Detail & Related papers (2025-07-12T07:58:02Z)
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [41.81009725976217]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z)
Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence. We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z)
What Makes Pre-Trained Visual Representations Successful for Robust Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z)
HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner. We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z)
Variational Structured Attention Networks for Deep Visual Representation Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner. Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework. We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z)
Learning Visual Representations for Transfer Learning by Suppressing Texture [38.901410057407766]
In self-supervised learning, texture as a low-level cue may provide shortcuts that prevent the network from learning higher level representations. We propose to use classic methods based on anisotropic diffusion to augment training using images with suppressed texture. We empirically show that our method achieves state-of-the-art results on object detection and image classification.
arXiv Detail & Related papers (2020-11-03T18:27:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.