EventTransAct: A video transformer-based framework for Event-camera
based action recognition
- URL: http://arxiv.org/abs/2308.13711v1
- Date: Fri, 25 Aug 2023 23:51:07 GMT
- Title: EventTransAct: A video transformer-based framework for Event-camera
based action recognition
- Authors: Tristan de Blegiers, Ishan Rajendrakumar Dave, Adeel Yousaf, Mubarak
Shah
- Abstract summary: Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
- Score: 52.537021302246664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing and comprehending human actions and gestures is a crucial
perception requirement for robots to interact with humans and carry out tasks
in diverse domains, including service robotics, healthcare, and manufacturing.
Event cameras, with their ability to capture fast-moving objects at a high
temporal resolution, offer new opportunities compared to standard action
recognition in RGB videos. However, previous research on event camera action
recognition has primarily focused on sensor-specific network architectures and
image encoding, which may not be suitable for new sensors and limit the use of
recent advancements in transformer-based architectures. In this study, we
employ a computationally efficient model, namely the video transformer network
(VTN), which initially acquires spatial embeddings per event-frame and then
utilizes a temporal self-attention mechanism. In order to better adopt the VTN
for the sparse and fine-grained nature of event data, we design
Event-Contrastive Loss ($\mathcal{L}_{EC}$) and event-specific augmentations.
Proposed $\mathcal{L}_{EC}$ promotes learning fine-grained spatial cues in the
spatial backbone of VTN by contrasting temporally misaligned frames. We
evaluate our method on real-world action recognition of N-EPIC Kitchens
dataset, and achieve state-of-the-art results on both protocols - testing in
seen kitchen (\textbf{74.9\%} accuracy) and testing in unseen kitchens
(\textbf{42.43\% and 46.66\% Accuracy}). Our approach also takes less
computation time compared to competitive prior approaches, which demonstrates
the potential of our framework \textit{EventTransAct} for real-world
applications of event-camera based action recognition. Project Page:
\url{https://tristandb8.github.io/EventTransAct_webpage/}
Related papers
- Spatio-temporal Transformers for Action Unit Classification with Event Cameras [28.98336123799572]
We present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of RGB videos and event streams.
We show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos.
arXiv Detail & Related papers (2024-10-29T11:23:09Z) - EV-Catcher: High-Speed Object Catching Using Low-latency Event-based
Neural Networks [107.62975594230687]
We demonstrate an application where event cameras excel: accurately estimating the impact location of fast-moving objects.
We introduce a lightweight event representation called Binary Event History Image (BEHI) to encode event data at low latency.
We show that the system is capable of achieving a success rate of 81% in catching balls targeted at different locations, with a velocity of up to 13 m/s even on compute-constrained embedded platforms.
arXiv Detail & Related papers (2023-04-14T15:23:28Z) - How Many Events do You Need? Event-based Visual Place Recognition Using
Sparse But Varying Pixels [29.6328152991222]
One of the potential applications of event camera research lies in visual place recognition for robot localization.
We show that the absolute difference in the number of events at those pixel locations accumulated into event frames can be sufficient for the place recognition task.
We evaluate our proposed approach on the Brisbane-Event-VPR dataset in an outdoor driving scenario, as well as the newly contributed indoor QCR-Event-VPR dataset.
arXiv Detail & Related papers (2022-06-28T00:24:12Z) - Event Transformer [43.193463048148374]
Event camera's low power consumption and ability to capture microsecond brightness make it attractive for various computer vision tasks.
Existing event representation methods typically convert events into frames, voxel grids, or spikes for deep neural networks (DNNs)
This work introduces a novel token-based event representation, where each event is considered a fundamental processing unit termed an event-token.
arXiv Detail & Related papers (2022-04-11T15:05:06Z) - Hybrid SNN-ANN: Energy-Efficient Classification and Object Detection for
Event-Based Vision [64.71260357476602]
Event-based vision sensors encode local pixel-wise brightness changes in streams of events rather than image frames.
Recent progress in object recognition from event-based sensors has come from conversions of deep neural networks.
We propose a hybrid architecture for end-to-end training of deep neural networks for event-based pattern recognition and object detection.
arXiv Detail & Related papers (2021-12-06T23:45:58Z) - Bridging the Gap between Events and Frames through Unsupervised Domain
Adaptation [57.22705137545853]
We propose a task transfer method that allows models to be trained directly with labeled images and unlabeled event data.
We leverage the generative event model to split event features into content and motion features.
Our approach unlocks the vast amount of existing image datasets for the training of event-based neural networks.
arXiv Detail & Related papers (2021-09-06T17:31:37Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Neuromorphic Eye-in-Hand Visual Servoing [0.9949801888214528]
Event cameras give human-like vision capabilities with low latency and wide dynamic range.
We present a visual servoing method using an event camera and a switching control strategy to explore, reach and grasp.
Experiments prove the effectiveness of the method to track and grasp objects of different shapes without the need for re-tuning.
arXiv Detail & Related papers (2020-04-15T23:57:54Z) - Event-based Asynchronous Sparse Convolutional Networks [54.094244806123235]
Event cameras are bio-inspired sensors that respond to per-pixel brightness changes in the form of asynchronous and sparse "events"
We present a general framework for converting models trained on synchronous image-like event representations into asynchronous models with identical output.
We show both theoretically and experimentally that this drastically reduces the computational complexity and latency of high-capacity, synchronous neural networks.
arXiv Detail & Related papers (2020-03-20T08:39:49Z) - A Differentiable Recurrent Surface for Asynchronous Event-Based Data [19.605628378366667]
We propose Matrix-LSTM, a grid of Long Short-Term Memory (LSTM) cells that efficiently process events and learn end-to-end task-dependent event-surfaces.
Compared to existing reconstruction approaches, our learned event-surface shows good flexibility and on optical flow estimation.
It improves the state-of-the-art of event-based object classification on the N-Cars dataset.
arXiv Detail & Related papers (2020-01-10T14:09:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.