Related papers: A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

URL: http://arxiv.org/abs/2404.17335v2
Date: Wed, 1 May 2024 08:54:54 GMT
Title: A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation
Authors: Xin Zhang, Liangxiu Han, Tam Sobeih, Lianghao Han, Darren Dancey,
Abstract summary: Event cameras operate differently from traditional digital cameras, continuously capturing data and generating binary spikes that encode time, location, and light intensity. This necessitates the development of innovative, spike-aware algorithms tailored for event cameras. We propose a purely spike-driven spike transformer network for depth estimation from spiking camera data.
Score: 3.355813093377501
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Depth estimation is crucial for interpreting complex environments, especially in areas such as autonomous vehicle navigation and robotics. Nonetheless, obtaining accurate depth readings from event camera data remains a formidable challenge. Event cameras operate differently from traditional digital cameras, continuously capturing data and generating asynchronous binary spikes that encode time, location, and light intensity. Yet, the unique sampling mechanisms of event cameras render standard image based algorithms inadequate for processing spike data. This necessitates the development of innovative, spike-aware algorithms tailored for event cameras, a task compounded by the irregularity, continuity, noise, and spatial and temporal characteristics inherent in spiking data.Harnessing the strong generalization capabilities of transformer neural networks for spatiotemporal data, we propose a purely spike-driven spike transformer network for depth estimation from spiking camera data. To address performance limitations with Spiking Neural Networks (SNN), we introduce a novel single-stage cross-modality knowledge transfer framework leveraging knowledge from a large vision foundational model of artificial neural networks (ANN) (DINOv2) to enhance the performance of SNNs with limited data. Our experimental results on both synthetic and real datasets show substantial improvements over existing models, with notable gains in Absolute Relative and Square Relative errors (49% and 39.77% improvements over the benchmark model Spike-T, respectively). Besides accuracy, the proposed model also demonstrates reduced power consumptions, a critical factor for practical applications.

Related papers

BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN) We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
SDformerFlow: Spatiotemporal swin spikeformer for event-based optical flow estimation [10.696635172502141]
Event cameras generate asynchronous and sparse event streams capturing changes in light intensity. Spiking neural networks (SNNs) share similar asynchronous and sparse characteristics and are well-suited for event cameras. We propose two solutions for fast and robust optical flow estimation for event cameras: STTFlowNet and SDFlowformer.
arXiv Detail & Related papers (2024-09-06T07:48:18Z)
Event-Stream Super Resolution using Sigma-Delta Neural Network [0.10923877073891444]
Event cameras present unique challenges due to their low resolution and sparse, asynchronous nature of the data they collect. Current event super-resolution algorithms are not fully optimized for the distinct data structure produced by event cameras. Research proposes a method that integrates binary spikes with Sigma Delta Neural Networks (SDNNs)
arXiv Detail & Related papers (2024-08-13T15:25:18Z)
ELA: Efficient Local Attention for Deep Convolutional Neural Networks [15.976475674061287]
This paper introduces an Efficient Local Attention (ELA) method that achieves substantial performance improvements with a simple structure. To overcome these challenges, we propose the incorporation of 1D convolution and Group Normalization feature enhancement techniques. ELA can be seamlessly integrated into deep CNN networks such as ResNet, MobileNet, and DeepLab.
arXiv Detail & Related papers (2024-03-02T08:06:18Z)
Efficient Multi-domain Text Recognition Deep Neural Network Parameterization with Residual Adapters [4.454976752204893]
This study presents a novel neural network model adept at optical character recognition (OCR) across diverse domains. The model is designed to achieve rapid adaptation to new domains, maintain a compact size conducive to reduced computational resource demand, ensure high accuracy, retain knowledge from previous learning experiences, and allow for domain-specific performance improvements without the need to retrain entirely.
arXiv Detail & Related papers (2024-01-01T23:01:40Z)
ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation [11.92011909884167]
We introduce spatial cues by training a teacher network that leverages left-right image pairs as inputs. We apply both attention-adapted feature distillation and focal-depth-adapted response distillation in the training stage. Our experiments on the real depth estimation datasets KITTI and DrivingStereo demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-09-26T08:12:37Z)
Computation-efficient Deep Learning for Computer Vision: A Survey [121.84121397440337]
Deep learning models have reached or even exceeded human-level performance in a range of visual perception tasks. Deep learning models usually demand significant computational resources, leading to impractical power consumption, latency, or carbon emissions in real-world scenarios. New research focus is computationally efficient deep learning, which strives to achieve satisfactory performance while minimizing the computational cost during inference.
arXiv Detail & Related papers (2023-08-27T03:55:28Z)
Training Robust Spiking Neural Networks with ViewPoint Transform and SpatioTemporal Stretching [4.736525128377909]
We propose a novel data augmentation method, ViewPoint Transform and Spatio Stretching (VPT-STS) It improves the robustness of spiking neural networks by transforming the rotation centers and angles in thetemporal domain to generate samples from different viewpoints. Experiments on prevailing neuromorphic datasets demonstrate that VPT-STS is broadly effective on multi-event representations and significantly outperforms pure spatial geometric transformations.
arXiv Detail & Related papers (2023-03-14T03:09:56Z)
Optical flow estimation from event-based cameras and spiking neural networks [0.4899818550820575]
Event-based sensors are an excellent fit for Spiking Neural Networks (SNNs) We propose a U-Net-like SNN which, after supervised training, is able to make dense optical flow estimations. Thanks to separable convolutions, we have been able to develop a light model that can nonetheless yield reasonably accurate optical flow estimates.
arXiv Detail & Related papers (2023-02-13T16:17:54Z)
Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image [7.753378095194288]
Vision-based approaches have recently received much attention and can overcome these drawbacks. In this work, we explore an extreme scenario in vision-based settings: estimate a depth map from one monocular image severely plagued by grid artifacts and blurry edges. Our novel approach can find the focus of current image with minimal overhead and avoid losses of depth features.
arXiv Detail & Related papers (2022-10-24T23:01:25Z)
Unsupervised Spike Depth Estimation via Cross-modality Cross-domain Knowledge Transfer [53.413305467674434]
We introduce open-source RGB data to support spike depth estimation, leveraging its annotations and spatial information. We propose a cross-modality cross-domain (BiCross) framework to realize unsupervised spike depth estimation. Our method achieves state-of-the-art (SOTA) performances, compared with RGB-oriented unsupervised depth estimation methods.
arXiv Detail & Related papers (2022-08-26T09:35:20Z)
Hybrid SNN-ANN: Energy-Efficient Classification and Object Detection for Event-Based Vision [64.71260357476602]
Event-based vision sensors encode local pixel-wise brightness changes in streams of events rather than image frames. Recent progress in object recognition from event-based sensors has come from conversions of deep neural networks. We propose a hybrid architecture for end-to-end training of deep neural networks for event-based pattern recognition and object detection.
arXiv Detail & Related papers (2021-12-06T23:45:58Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
Learning Frequency-aware Dynamic Network for Efficient Super-Resolution [56.98668484450857]
This paper explores a novel frequency-aware dynamic network for dividing the input into multiple parts according to its coefficients in the discrete cosine transform (DCT) domain. In practice, the high-frequency part will be processed using expensive operations and the lower-frequency part is assigned with cheap operations to relieve the computation burden. Experiments conducted on benchmark SISR models and datasets show that the frequency-aware dynamic network can be employed for various SISR neural architectures.
arXiv Detail & Related papers (2021-03-15T12:54:26Z)
Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction [51.072733683919246]
We introduce Recurrent Asynchronous Multimodal (RAM) networks to handle asynchronous and irregular data from multiple sensors. Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction. We show an improvement over state-of-the-art methods by up to 30% in terms of mean depth absolute error.
arXiv Detail & Related papers (2021-02-18T13:24:35Z)
Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video. Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer. To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames. Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction. We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z)
On Deep Learning Techniques to Boost Monocular Depth Estimation for Autonomous Navigation [1.9007546108571112]
Inferring the depth of images is a fundamental inverse problem within the field of Computer Vision. We propose a new lightweight and fast supervised CNN architecture combined with novel feature extraction models. We also introduce an efficient surface normals module, jointly with a simple geometric 2.5D loss function, to solve SIDE problems.
arXiv Detail & Related papers (2020-10-13T18:37:38Z)
Event-based Asynchronous Sparse Convolutional Networks [54.094244806123235]
Event cameras are bio-inspired sensors that respond to per-pixel brightness changes in the form of asynchronous and sparse "events" We present a general framework for converting models trained on synchronous image-like event representations into asynchronous models with identical output. We show both theoretically and experimentally that this drastically reduces the computational complexity and latency of high-capacity, synchronous neural networks.
arXiv Detail & Related papers (2020-03-20T08:39:49Z)
Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks [40.44712305614071]
We present Spike-FlowNet, a deep hybrid neural network architecture integrating SNNs and ANNs for efficiently estimating optical flow from sparse event camera outputs. The network is end-to-end trained with self-supervised learning on Multi-Vehicle Stereo Event Camera (MVSEC) dataset.
arXiv Detail & Related papers (2020-03-14T20:37:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.