Event-based Vision for Early Prediction of Manipulation Actions
- URL: http://arxiv.org/abs/2307.14332v1
- Date: Wed, 26 Jul 2023 17:50:17 GMT
- Title: Event-based Vision for Early Prediction of Manipulation Actions
- Authors: Daniel Deniz and Cornelia Fermuller and Eduardo Ros and Manuel
Rodriguez-Alvarez and Francisco Barranco
- Abstract summary: Neuromorphic visual sensors are artificial retinas that sequences output of events when brightness changes occur in the scene.
In this study, we introduce an event-based dataset on fine-grained manipulation actions.
We also perform an experimental study on the use of transformers for action prediction with events.
- Score: 0.7699714865575189
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neuromorphic visual sensors are artificial retinas that output sequences of
asynchronous events when brightness changes occur in the scene. These sensors
offer many advantages including very high temporal resolution, no motion blur
and smart data compression ideal for real-time processing. In this study, we
introduce an event-based dataset on fine-grained manipulation actions and
perform an experimental study on the use of transformers for action prediction
with events. There is enormous interest in the fields of cognitive robotics and
human-robot interaction on understanding and predicting human actions as early
as possible. Early prediction allows anticipating complex stages for planning,
enabling effective and real-time interaction. Our Transformer network uses
events to predict manipulation actions as they occur, using online inference.
The model succeeds at predicting actions early on, building up confidence over
time and achieving state-of-the-art classification. Moreover, the
attention-based transformer architecture allows us to study the role of the
spatio-temporal patterns selected by the model. Our experiments show that the
Transformer network captures action dynamic features outperforming video-based
approaches and succeeding with scenarios where the differences between actions
lie in very subtle cues. Finally, we release the new event dataset, which is
the first in the literature for manipulation action recognition. Code will be
available at https://github.com/DaniDeniz/EventVisionTransformer.
Related papers
- E-Motion: Future Motion Simulation via Event Sequence Diffusion [86.80533612211502]
Event-based sensors may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable.
We propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework.
Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.
arXiv Detail & Related papers (2024-10-11T09:19:23Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - A Framework for Multisensory Foresight for Embodied Agents [11.351546861334292]
Predicting future sensory states is crucial for learning agents such as robots, drones, and autonomous vehicles.
In this paper, we couple multiple sensory modalities with exploratory actions and propose a predictive neural network architecture to address this problem.
The framework was tested and validated with a dataset containing 4 sensory modalities (vision, haptic, audio, and tactile) on a humanoid robot performing 9 behaviors multiple times on a large set of objects.
arXiv Detail & Related papers (2021-09-15T20:20:04Z) - Object and Relation Centric Representations for Push Effect Prediction [18.990827725752496]
Pushing is an essential non-prehensile manipulation skill used for tasks ranging from pre-grasp manipulation to scene rearrangement.
We propose a graph neural network based framework for effect prediction and parameter estimation of pushing actions.
Our framework is validated both in real and simulated environments containing different shaped multi-part objects connected via different types of joints and objects with different masses.
arXiv Detail & Related papers (2021-02-03T15:09:12Z) - End-to-end Contextual Perception and Prediction with Interaction
Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving.
To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture.
Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z) - TTPP: Temporal Transformer with Progressive Prediction for Efficient
Action Anticipation [46.28067541184604]
Video action anticipation aims to predict future action categories from observed frames.
Current state-of-the-art approaches mainly resort to recurrent neural networks to encode history information into hidden states.
This paper proposes a simple yet efficient Temporal Transformer with Progressive Prediction framework.
arXiv Detail & Related papers (2020-03-07T07:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.