Combining Events and Frames using Recurrent Asynchronous Multimodal
Networks for Monocular Depth Prediction
- URL: http://arxiv.org/abs/2102.09320v1
- Date: Thu, 18 Feb 2021 13:24:35 GMT
- Title: Combining Events and Frames using Recurrent Asynchronous Multimodal
Networks for Monocular Depth Prediction
- Authors: Daniel Gehrig, Michelle R\"uegg, Mathias Gehrig, Javier Hidalgo
Carrio, Davide Scaramuzza
- Abstract summary: We introduce Recurrent Asynchronous Multimodal (RAM) networks to handle asynchronous and irregular data from multiple sensors.
Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction.
We show an improvement over state-of-the-art methods by up to 30% in terms of mean depth absolute error.
- Score: 51.072733683919246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Event cameras are novel vision sensors that report per-pixel brightness
changes as a stream of asynchronous "events". They offer significant advantages
compared to standard cameras due to their high temporal resolution, high
dynamic range and lack of motion blur. However, events only measure the varying
component of the visual signal, which limits their ability to encode scene
context. By contrast, standard cameras measure absolute intensity frames, which
capture a much richer representation of the scene. Both sensors are thus
complementary. However, due to the asynchronous nature of events, combining
them with synchronous images remains challenging, especially for learning-based
methods. This is because traditional recurrent neural networks (RNNs) are not
designed for asynchronous and irregular data from additional sensors. To
address this challenge, we introduce Recurrent Asynchronous Multimodal (RAM)
networks, which generalize traditional RNNs to handle asynchronous and
irregular data from multiple sensors. Inspired by traditional RNNs, RAM
networks maintain a hidden state that is updated asynchronously and can be
queried at any time to generate a prediction. We apply this novel architecture
to monocular depth estimation with events and frames where we show an
improvement over state-of-the-art methods by up to 30% in terms of mean
absolute depth error. To enable further research on multimodal learning with
events, we release EventScape, a new dataset with events, intensity frames,
semantic labels, and depth maps recorded in the CARLA simulator.
Related papers
- BlinkTrack: Feature Tracking over 100 FPS via Events and Images [50.98675227695814]
We propose a novel framework, BlinkTrack, which integrates event data with RGB images for high-frequency feature tracking.
Our method extends the traditional Kalman filter into a learning-based framework, utilizing differentiable Kalman filters in both event and image branches.
Experimental results indicate that BlinkTrack significantly outperforms existing event-based methods.
arXiv Detail & Related papers (2024-09-26T15:54:18Z) - Revisiting Event-based Video Frame Interpolation [49.27404719898305]
Dynamic vision sensors or event cameras provide rich complementary information for video frame.
estimating optical flow from events is arguably more difficult than from RGB information.
We propose a divide-and-conquer strategy in which event-based intermediate frame synthesis happens incrementally in multiple simplified stages.
arXiv Detail & Related papers (2023-07-24T06:51:07Z) - Deformable Convolutions and LSTM-based Flexible Event Frame Fusion
Network for Motion Deblurring [7.187030024676791]
Event cameras differ from conventional RGB cameras in that they produce asynchronous data sequences.
While RGB cameras capture every frame at a fixed rate, event cameras only capture changes in the scene, resulting in sparse and asynchronous data output.
Recent state-of-the-art CNN-based deblurring solutions produce multiple 2-D event frames based on the accumulation of event data over a time period.
It is particularly useful for scenarios in which exposure times vary depending on factors such as lighting conditions or the presence of fast-moving objects in the scene.
arXiv Detail & Related papers (2023-06-01T15:57:12Z) - Optical flow estimation from event-based cameras and spiking neural
networks [0.4899818550820575]
Event-based sensors are an excellent fit for Spiking Neural Networks (SNNs)
We propose a U-Net-like SNN which, after supervised training, is able to make dense optical flow estimations.
Thanks to separable convolutions, we have been able to develop a light model that can nonetheless yield reasonably accurate optical flow estimates.
arXiv Detail & Related papers (2023-02-13T16:17:54Z) - Asynchronous Optimisation for Event-based Visual Odometry [53.59879499700895]
Event cameras open up new possibilities for robotic perception due to their low latency and high dynamic range.
We focus on event-based visual odometry (VO)
We propose an asynchronous structure-from-motion optimisation back-end.
arXiv Detail & Related papers (2022-03-02T11:28:47Z) - Fusion-FlowNet: Energy-Efficient Optical Flow Estimation using Sensor
Fusion and Deep Fused Spiking-Analog Network Architectures [7.565038387344594]
We present a sensor fusion framework for energy-efficient optical flow estimation using both frame- and event-based sensors.
Our network is end-to-end trained using unsupervised learning to avoid expensive video annotations.
arXiv Detail & Related papers (2021-03-19T02:03:33Z) - Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames.
Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction.
We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z) - Event-based Asynchronous Sparse Convolutional Networks [54.094244806123235]
Event cameras are bio-inspired sensors that respond to per-pixel brightness changes in the form of asynchronous and sparse "events"
We present a general framework for converting models trained on synchronous image-like event representations into asynchronous models with identical output.
We show both theoretically and experimentally that this drastically reduces the computational complexity and latency of high-capacity, synchronous neural networks.
arXiv Detail & Related papers (2020-03-20T08:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.