What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions
- URL: http://arxiv.org/abs/2010.08539v2
- Date: Sat, 6 Mar 2021 19:28:58 GMT
- Title: What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions
- Authors: Kiana Ehsani, Daniel Gordon, Thomas Nguyen, Roozbeh Mottaghi, Ali
Farhadi
- Abstract summary: We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
- Score: 50.435861435121915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning effective representations of visual data that generalize to a
variety of downstream tasks has been a long quest for computer vision. Most
representation learning approaches rely solely on visual data such as images or
videos. In this paper, we explore a novel approach, where we use human
interaction and attention cues to investigate whether we can learn better
representations compared to visual-only representations. For this study, we
collect a dataset of human interactions capturing body part movements and gaze
in their daily lives. Our experiments show that our "muscly-supervised"
representation that encodes interaction and attention cues outperforms a
visual-only state-of-the-art method MoCo (He et al.,2020), on a variety of
target tasks: scene classification (semantic), action recognition (temporal),
depth estimation (geometric), dynamics prediction (physics) and walkable
surface estimation (affordance). Our code and dataset are available at:
https://github.com/ehsanik/muscleTorch.
Related papers
- Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - Neural feels with neural fields: Visuo-tactile perception for in-hand
manipulation [57.60490773016364]
We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation.
Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem.
Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation.
arXiv Detail & Related papers (2023-12-20T22:36:37Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Brief Introduction to Contrastive Learning Pretext Tasks for Visual
Representation [0.0]
We introduce contrastive learning, a subset of unsupervised learning methods.
The purpose of contrastive learning is to embed augmented samples from the same sample near to each other while pushing away those that are not.
We offer some strategies from contrastive learning that have recently been published and are focused on pretext tasks for visual representation.
arXiv Detail & Related papers (2022-10-06T18:54:10Z) - Playful Interactions for Representation Learning [82.59215739257104]
We propose to use playful interactions in a self-supervised manner to learn visual representations for downstream tasks.
We collect 2 hours of playful data in 19 diverse environments and use self-predictive learning to extract visual representations.
Our representations generalize better than standard behavior cloning and can achieve similar performance with only half the number of required demonstrations.
arXiv Detail & Related papers (2021-07-19T17:54:48Z) - Physion: Evaluating Physical Prediction from Vision in Humans and
Machines [46.19008633309041]
We present a visual and physical prediction benchmark that precisely measures this capability.
We compare an array of algorithms on their ability to make diverse physical predictions.
We find that graph neural networks with access to the physical state best capture human behavior.
arXiv Detail & Related papers (2021-06-15T16:13:39Z) - Imitation Learning with Human Eye Gaze via Multi-Objective Prediction [3.5779268406205618]
We propose Gaze Regularized Imitation Learning (GRIL), a novel context-aware imitation learning architecture.
GRIL learns concurrently from both human demonstrations and eye gaze to solve tasks where visual attention provides important context.
We show that GRIL outperforms several state-of-the-art gaze-based imitation learning algorithms, simultaneously learns to predict human visual attention, and generalizes to scenarios not present in the training data.
arXiv Detail & Related papers (2021-02-25T17:13:13Z) - VisualEchoes: Spatial Image Representation Learning through Echolocation [97.23789910400387]
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation.
We propose a novel interaction-based representation learning framework that learns useful visual features via echolocation.
Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.
arXiv Detail & Related papers (2020-05-04T16:16:58Z) - Active Perception and Representation for Robotic Manipulation [0.8315801422499861]
We present a framework that leverages the benefits of active perception to accomplish manipulation tasks.
Our agent uses viewpoint changes to localize objects, to learn state representations in a self-supervised manner, and to perform goal-directed actions.
Compared to vanilla deep Q-learning algorithms, our model is at least four times more sample-efficient.
arXiv Detail & Related papers (2020-03-15T01:43:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.