Event-VPR: End-to-End Weakly Supervised Network Architecture for
Event-based Visual Place Recognition
- URL: http://arxiv.org/abs/2011.03290v1
- Date: Fri, 6 Nov 2020 11:32:04 GMT
- Title: Event-VPR: End-to-End Weakly Supervised Network Architecture for
Event-based Visual Place Recognition
- Authors: Delei Kong, Zheng Fang, Haojia Li, Kuanxu Hou, Sonya Coleman and
Dermot Kerr
- Abstract summary: We propose an end-to-end visual place recognition network for event cameras.
The proposed algorithm is firstly to characterize the event streams with the EST voxel grid, then extract features using a convolution network, and finally aggregate features using an improved VLAD network.
Experimental results show that the proposed method can achieve much better performance in challenging scenarios.
- Score: 9.371066729205268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional visual place recognition (VPR) methods generally use frame-based
cameras, which is easy to fail due to dramatic illumination changes or fast
motions. In this paper, we propose an end-to-end visual place recognition
network for event cameras, which can achieve good place recognition performance
in challenging environments. The key idea of the proposed algorithm is firstly
to characterize the event streams with the EST voxel grid, then extract
features using a convolution network, and finally aggregate features using an
improved VLAD network to realize end-to-end visual place recognition using
event streams. To verify the effectiveness of the proposed algorithm, we
compare the proposed method with classical VPR methods on the event-based
driving datasets (MVSEC, DDD17) and the synthetic datasets (Oxford RobotCar).
Experimental results show that the proposed method can achieve much better
performance in challenging scenarios. To our knowledge, this is the first
end-to-end event-based VPR method. The accompanying source code is available at
https://github.com/kongdelei/Event-VPR.
Related papers
- EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition [54.55914886780534]
Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion.<n>We introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR.<n>EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios.
arXiv Detail & Related papers (2026-02-13T13:25:05Z) - SuperEIO: Self-Supervised Event Feature Learning for Event Inertial Odometry [6.552812892993662]
Event cameras asynchronously output low-latency event streams, promising for state estimation in high-speed motion and challenging lighting conditions.
We propose SuperEIO, a novel framework that leverages the learning-based event-only detection and IMU measurements to achieve eventinertial odometry.
We evaluate our method extensively on multiple public datasets, demonstrating its superior accuracy and robustness compared to other state-of-the-art event-based methods.
arXiv Detail & Related papers (2025-03-29T03:58:15Z) - Event Stream-based Visual Object Tracking: HDETrack V2 and A High-Definition Benchmark [36.9654606035663]
We introduce a novel hierarchical knowledge distillation strategy to guide the learning of the student Transformer network.
We adapt the network model to specific target objects during testing via a newly proposed test-time tuning strategy.
We propose EventVOT, the first large-scale high-resolution event-based tracking dataset.
arXiv Detail & Related papers (2025-02-08T13:59:52Z) - Deep Homography Estimation for Visual Place Recognition [49.235432979736395]
We propose a transformer-based deep homography estimation (DHE) network.
It takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification.
Experiments on benchmark datasets show that our method can outperform several state-of-the-art methods.
arXiv Detail & Related papers (2024-02-25T13:22:17Z) - Event Stream-based Visual Object Tracking: A High-Resolution Benchmark
Dataset and A Novel Baseline [38.42400442371156]
Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker.
We propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer.
We propose the first large-scale high-resolution ($1280 times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc.
arXiv Detail & Related papers (2023-09-26T01:42:26Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - Cross-modal Place Recognition in Image Databases using Event-based
Sensors [28.124708490967713]
We present the first cross-modal visual place recognition framework that is capable of retrieving regular images from a database given an event query.
Our method demonstrates promising results with respect to the state-of-the-art frame-based and event-based methods on the Brisbane-Event-VPR dataset.
arXiv Detail & Related papers (2023-07-03T14:24:04Z) - Spiking-Fer: Spiking Neural Network for Facial Expression Recognition
With Event Cameras [2.9398911304923447]
"Spiking-FER" is a deep convolutional SNN model, and compare it against a similar Artificial Neural Network (ANN)
Experiments show that the proposed approach achieves comparable performance to the ANN architecture, while consuming less energy by orders of magnitude (up to 65.39x)
arXiv Detail & Related papers (2023-04-20T10:59:56Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - AR-Net: Adaptive Frame Resolution for Efficient Action Recognition [70.62587948892633]
Action recognition is an open and challenging problem in computer vision.
We propose a novel approach, called AR-Net, that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition.
arXiv Detail & Related papers (2020-07-31T01:36:04Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Dynamic Inference: A New Approach Toward Efficient Video Action
Recognition [69.9658249941149]
Action recognition in videos has achieved great success recently, but it remains a challenging task due to the massive computational cost.
We propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos.
arXiv Detail & Related papers (2020-02-09T11:09:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.