A Framework for Multisensory Foresight for Embodied Agents
- URL: http://arxiv.org/abs/2109.07561v1
- Date: Wed, 15 Sep 2021 20:20:04 GMT
- Title: A Framework for Multisensory Foresight for Embodied Agents
- Authors: Xiaohui Chen, Ramtin Hosseini, Karen Panetta, Jivko Sinapov
- Abstract summary: Predicting future sensory states is crucial for learning agents such as robots, drones, and autonomous vehicles.
In this paper, we couple multiple sensory modalities with exploratory actions and propose a predictive neural network architecture to address this problem.
The framework was tested and validated with a dataset containing 4 sensory modalities (vision, haptic, audio, and tactile) on a humanoid robot performing 9 behaviors multiple times on a large set of objects.
- Score: 11.351546861334292
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting future sensory states is crucial for learning agents such as
robots, drones, and autonomous vehicles. In this paper, we couple multiple
sensory modalities with exploratory actions and propose a predictive neural
network architecture to address this problem. Most existing approaches rely on
large, manually annotated datasets, or only use visual data as a single
modality. In contrast, the unsupervised method presented here uses multi-modal
perceptions for predicting future visual frames. As a result, the proposed
model is more comprehensive and can better capture the spatio-temporal dynamics
of the environment, leading to more accurate visual frame prediction. The other
novelty of our framework is the use of sub-networks dedicated to anticipating
future haptic, audio, and tactile signals. The framework was tested and
validated with a dataset containing 4 sensory modalities (vision, haptic,
audio, and tactile) on a humanoid robot performing 9 behaviors multiple times
on a large set of objects. While the visual information is the dominant
modality, utilizing the additional non-visual modalities improves the accuracy
of predictions.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Pedestrian 3D Bounding Box Prediction [83.7135926821794]
We focus on 3D bounding boxes, which are reasonable estimates of humans without modeling complex motion details for autonomous vehicles.
We suggest this new problem and present a simple yet effective model for pedestrians' 3D bounding box prediction.
This method follows an encoder-decoder architecture based on recurrent neural networks.
arXiv Detail & Related papers (2022-06-28T17:59:45Z) - A Variational Graph Autoencoder for Manipulation Action Recognition and
Prediction [1.1816942730023883]
We introduce a deep graph autoencoder to jointly learn recognition and prediction of manipulation tasks from symbolic scene graphs.
Our network has a variational autoencoder structure with two branches: one for identifying the input graph type and one for predicting the future graphs.
We benchmark our new model against different state-of-the-art methods on two different datasets, MANIAC and MSRC-9, and show that our proposed model can achieve better performance.
arXiv Detail & Related papers (2021-10-25T21:40:42Z) - Dynamic Modeling of Hand-Object Interactions via Tactile Sensing [133.52375730875696]
In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects.
We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model.
This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing.
arXiv Detail & Related papers (2021-09-09T16:04:14Z) - Physion: Evaluating Physical Prediction from Vision in Humans and
Machines [46.19008633309041]
We present a visual and physical prediction benchmark that precisely measures this capability.
We compare an array of algorithms on their ability to make diverse physical predictions.
We find that graph neural networks with access to the physical state best capture human behavior.
arXiv Detail & Related papers (2021-06-15T16:13:39Z) - Future Frame Prediction for Robot-assisted Surgery [57.18185972461453]
We propose a ternary prior guided variational autoencoder (TPG-VAE) model for future frame prediction in robotic surgical video sequences.
Besides content distribution, our model learns motion distribution, which is novel to handle the small movements of surgical tools.
arXiv Detail & Related papers (2021-03-18T15:12:06Z) - AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory
Prediction [30.61190086847564]
We propose a generative architecture for multi-future trajectory predictions based on Conditional Variational Recurrent Neural Networks (C-VRNNs)
Human interactions are modeled with a graph-based attention mechanism enabling an online attentive hidden state refinement of the recurrent estimation.
arXiv Detail & Related papers (2020-05-17T17:21:23Z) - Knowledge Distillation for Action Anticipation via Label Smoothing [21.457069042129138]
Human capability to anticipate near future from visual observations and non-verbal cues is essential for developing intelligent systems.
We implement a multi-modal framework based on long short-term memory (LSTM) networks to summarize past observations and make predictions at different time steps.
Experiments show that label smoothing systematically improves performance of state-of-the-art models for action anticipation.
arXiv Detail & Related papers (2020-04-16T15:38:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.