EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching
- URL: http://arxiv.org/abs/2407.21735v2
- Date: Fri, 22 Nov 2024 03:58:23 GMT
- Title: EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching
- Authors: Pengjie Zhang, Lin Zhu, Xiao Wang, Lizhi Wang, Wanxuan Lu, Hua Huang,
- Abstract summary: Event cameras have shown promise in vision applications like optical flow estimation and stereo matching.
We reformulate event-based flow estimation and stereo matching as a unified dense correspondence matching problem.
Our model can effectively handle both optical flow and stereo estimation, achieving state-of-the-art performance on both tasks.
- Score: 23.54420911697424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event cameras have shown promise in vision applications like optical flow estimation and stereo matching, with many specialized architectures leveraging the asynchronous and sparse nature of event data. However, existing works only focus event data within the confines of task-specific domains, overlooking how tasks across the temporal and spatial domains can reinforce each other. In this paper, we reformulate event-based flow estimation and stereo matching as a unified dense correspondence matching problem, enabling us to solve both tasks within a single model by directly matching features in a shared representation space. Specifically, our method utilizes a Temporal Recurrent Network to aggregate event features across temporal or spatial domains, and a Spatial Contextual Attention to enhance knowledge transfer across event flows via temporal or spatial interactions. By utilizing a shared feature similarities module that integrates knowledge from event streams via temporal or spatial interactions, our network performs optical flow estimation from temporal event segment inputs and stereo matching from spatial event segment inputs simultaneously. We demonstrate that our unified model inherently supports multi-task fusion and cross-task transfer. Without the need for retraining for specific task, our model can effectively handle both optical flow and stereo estimation, achieving state-of-the-art performance on both tasks.
Related papers
- EvRepSL: Event-Stream Representation via Self-Supervised Learning for Event-Based Vision [12.542303392870329]
Event-stream representation is the first step for many computer vision tasks using event cameras.
We introduce a data-driven approach aiming at enhancing the quality of event-stream representations.
arXiv Detail & Related papers (2024-12-10T00:42:54Z) - Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking [32.86991031493605]
Event-based bionic camera captures dynamic scenes with high temporal resolution and high dynamic range.
We propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters.
Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions.
arXiv Detail & Related papers (2024-09-26T06:12:08Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks.
RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model.
Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z) - Scalable Event-by-event Processing of Neuromorphic Sensory Signals With Deep State-Space Models [2.551844666707809]
Event-based sensors are well suited for real-time processing.
Current methods either collapse events into frames or cannot scale up when processing the event data directly event-by-event.
arXiv Detail & Related papers (2024-04-29T08:50:27Z) - Fast Window-Based Event Denoising with Spatiotemporal Correlation
Enhancement [85.66867277156089]
We propose window-based event denoising, which simultaneously deals with a stack of events.
In spatial domain, we choose maximum a posteriori (MAP) to discriminate real-world event and noise.
Our algorithm can remove event noise effectively and efficiently and improve the performance of downstream tasks.
arXiv Detail & Related papers (2024-02-14T15:56:42Z) - Representation Learning on Event Stream via an Elastic Net-incorporated
Tensor Network [1.9515859963221267]
We present a novel representation method which can capture global correlations of all events in the event stream simultaneously.
Our method can achieve effective results in applications like filtering noise compared with the state-of-the-art methods.
arXiv Detail & Related papers (2024-01-16T02:51:47Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams [19.957857885844838]
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams.
We propose an attentionaware model named Event Voxel Set Transformer (EVSTr) for efficient representation learning on event streams.
Experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.
arXiv Detail & Related papers (2023-03-07T12:48:02Z) - Event-Based Frame Interpolation with Ad-hoc Deblurring [68.97825675372354]
We propose a general method for event-based frame that performs deblurring ad-hoc on input videos.
Our network consistently outperforms state-of-the-art methods on frame, single image deblurring and the joint task of deblurring.
Our code and dataset will be made publicly available.
arXiv Detail & Related papers (2023-01-12T18:19:00Z) - Unifying Event Detection and Captioning as Sequence Generation via
Pre-Training [53.613265415703815]
We propose a unified pre-training and fine-tuning framework to enhance the inter-task association between event detection and captioning.
Our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data.
arXiv Detail & Related papers (2022-07-18T14:18:13Z) - Event Transformer [43.193463048148374]
Event camera's low power consumption and ability to capture microsecond brightness make it attractive for various computer vision tasks.
Existing event representation methods typically convert events into frames, voxel grids, or spikes for deep neural networks (DNNs)
This work introduces a novel token-based event representation, where each event is considered a fundamental processing unit termed an event-token.
arXiv Detail & Related papers (2022-04-11T15:05:06Z) - Unsupervised Feature Learning for Event Data: Direct vs Inverse Problem
Formulation [53.850686395708905]
Event-based cameras record an asynchronous stream of per-pixel brightness changes.
In this paper, we focus on single-layer architectures for representation learning from event data.
We show improvements of up to 9 % in the recognition accuracy compared to the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-23T10:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.