Using Navigational Information to Learn Visual Representations
- URL: http://arxiv.org/abs/2202.08114v1
- Date: Thu, 10 Feb 2022 20:17:55 GMT
- Title: Using Navigational Information to Learn Visual Representations
- Authors: Lizhen Zhu, Brad Wyble, James Z. Wang
- Abstract summary: We show that using spatial and temporal information in the pretraining stage of contrastive learning can improve the performance of downstream classification.
This work reveals the effectiveness and efficiency of contextual information for improving representation learning.
- Score: 7.747924294389427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Children learn to build a visual representation of the world from
unsupervised exploration and we hypothesize that a key part of this learning
ability is the use of self-generated navigational information as a similarity
label to drive a learning objective for self-supervised learning. The goal of
this work is to exploit navigational information in a visual environment to
provide performance in training that exceeds the state-of-the-art
self-supervised training. Here, we show that using spatial and temporal
information in the pretraining stage of contrastive learning can improve the
performance of downstream classification relative to conventional contrastive
learning approaches that use instance discrimination to discriminate between
two alterations of the same image or two different images. We designed a
pipeline to generate egocentric-vision images from a photorealistic ray-tracing
environment (ThreeDWorld) and record relevant navigational information for each
image. Modifying the Momentum Contrast (MoCo) model, we introduced spatial and
temporal information to evaluate the similarity of two views in the pretraining
stage instead of instance discrimination. This work reveals the effectiveness
and efficiency of contextual information for improving representation learning.
The work informs our understanding of the means by which children might learn
to see the world without external supervision.
Related papers
- Exploiting Contextual Uncertainty of Visual Data for Efficient Training of Deep Models [0.65268245109828]
We introduce the notion of contextual diversity for active learning CDAL.
We propose a data repair algorithm to curate contextually fair data to reduce model bias.
We are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads.
arXiv Detail & Related papers (2024-11-04T09:43:33Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - A Computational Account Of Self-Supervised Visual Learning From
Egocentric Object Play [3.486683381782259]
We study how learning signals that equate different viewpoints can support robust visual learning.
We find that representations learned by equating different physical viewpoints of an object benefit downstream image classification accuracy.
arXiv Detail & Related papers (2023-05-30T22:42:03Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - Focus on the Positives: Self-Supervised Learning for Biodiversity
Monitoring [9.086207853136054]
We address the problem of learning self-supervised representations from unlabeled image collections.
We exploit readily available context data that encodes information such as the spatial and temporal relationships between the input images.
For the critical task of global biodiversity monitoring, this results in image features that can be adapted to challenging visual species classification tasks with limited human supervision.
arXiv Detail & Related papers (2021-08-14T01:12:41Z) - Curious Representation Learning for Embodied Intelligence [81.21764276106924]
Self-supervised representation learning has achieved remarkable success in recent years.
Yet to build truly intelligent agents, we must construct representation learning algorithms that can learn from environments.
We propose a framework, curious representation learning, which jointly learns a reinforcement learning policy and a visual representation model.
arXiv Detail & Related papers (2021-05-03T17:59:20Z) - Factors of Influence for Transfer Learning across Diverse Appearance
Domains and Task Types [50.1843146606122]
A simple form of transfer learning is common in current state-of-the-art computer vision models.
Previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood.
In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains.
arXiv Detail & Related papers (2021-03-24T16:24:20Z) - Geography-Aware Self-Supervised Learning [79.4009241781968]
We show that due to their different characteristics, a non-trivial gap persists between contrastive and supervised learning on standard benchmarks.
We propose novel training methods that exploit the spatially aligned structure of remote sensing data.
Our experiments show that our proposed method closes the gap between contrastive and supervised learning on image classification, object detection and semantic segmentation for remote sensing.
arXiv Detail & Related papers (2020-11-19T17:29:13Z) - Learning to Visually Navigate in Photorealistic Environments Without any
Supervision [37.22924101745505]
We introduce a novel approach for learning to navigate from image inputs without external supervision or reward.
Our approach consists of three stages: learning a good representation of first-person views, then learning to explore using memory, and finally learning to navigate by setting its own goals.
We show the benefits of our approach by training an agent to navigate challenging photo-realistic environments from the Gibson dataset with RGB inputs only.
arXiv Detail & Related papers (2020-04-10T08:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.