Learning Flow-Guided Registration for RGB-Event Semantic Segmentation
- URL: http://arxiv.org/abs/2505.01548v2
- Date: Thu, 25 Sep 2025 07:06:47 GMT
- Title: Learning Flow-Guided Registration for RGB-Event Semantic Segmentation
- Authors: Zhen Yao, Xiaowen Ying, Zhiyu Zhu, Mooi Choo Chuah,
- Abstract summary: Event cameras capture microsecond-level motion cues that complement RGB sensors.<n>We recast RGB-Event segmentation from fusion to registration.<n>We propose BRENet, a novel flow-guided bidirectional framework that adaptively matches correspondence between the asymmetric modalities.
- Score: 22.996619370156584
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Event cameras capture microsecond-level motion cues that complement RGB sensors. However, the prevailing paradigm of treating RGB-Event perception as a fusion problem is ill-posed, as it ignores the intrinsic (i) Spatiotemporal and (ii) Modal Misalignment, unlike other RGB-X sensing domains. To tackle these limitations, we recast RGB-Event segmentation from fusion to registration. We propose BRENet, a novel flow-guided bidirectional framework that adaptively matches correspondence between the asymmetric modalities. Specifically, it leverages temporally aligned optical flows as a coarse-grained guide, along with fine-grained event temporal features, to generate precise forward and backward pixel pairings for registration. This pairing mechanism converts the inherent motion lag into terms governed by flow estimation error, bridging modality gaps. Moreover, we introduce Motion-Enhanced Event Tensor (MET), a new representation that transforms sparse event streams into a dense, temporally coherent form. Extensive experiments on four large-scale datasets validate our approach, establishing flow-guided registration as a promising direction for RGB-Event segmentation. Our code is available at: https://github.com/zyaocoder/BRENet.
Related papers
- PEPR: Privileged Event-based Predictive Regularization for Domain Generalization [19.185122873391517]
We propose a cross-modal framework under the learning using privileged information (LUPI) paradigm for training a robust, single-modality RGB model.<n>We leverage event cameras as a source of privileged information, available only during training.<n>We train the RGB encoder with PEPR to predict event-based latent features, distilling robustness without sacrificing semantic richness.
arXiv Detail & Related papers (2026-02-04T14:10:36Z) - Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking [51.31378940976401]
Existing RGB-Event tracking approaches fail to fully exploit the unique advantages of event cameras.<n>We propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality.<n>Experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method.
arXiv Detail & Related papers (2026-01-03T01:10:17Z) - SwiTrack: Tri-State Switch for Cross-Modal Object Tracking [74.15663758681849]
Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities.<n>We propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams.
arXiv Detail & Related papers (2025-11-20T10:52:54Z) - Leveraging RGB Images for Pre-Training of Event-Based Hand Pose Estimation [64.8814078041756]
RPEP is the first pre-training method for event-based 3D hand pose estimation using labeled RGB images and unpaired, unlabeled event data.<n>Our model significantly outperforms state-of-the-art methods on real event data, achieving up to 24% improvement on EvRealHands.
arXiv Detail & Related papers (2025-09-21T07:07:49Z) - Learning Frequency and Memory-Aware Prompts for Multi-Modal Object Tracking [74.15663758681849]
We present Learning Frequency and Memory-Aware Prompts, a dual-adapter framework that injects lightweight prompts into a frozen RGB tracker.<n>A frequency-guided visual adapter adaptively transfers complementary cues across modalities.<n>A multilevel memory adapter with short, long, and permanent memory stores, updates, and retrieves reliable temporal context.
arXiv Detail & Related papers (2025-06-30T15:38:26Z) - Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation [47.75348821902489]
Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time.<n>Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios.<n>This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality.
arXiv Detail & Related papers (2025-01-01T13:40:09Z) - Frequency-Adaptive Low-Latency Object Detection Using Events and Frames [23.786369609995013]
Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments.<n>Two critical mismatches: low-latency Events textitvs.high-latency RGB frames, and temporally sparse labels in training textitvs.continuous flow in inference.<n>We propose the textbfFrequency-textbfAdaptive Low-Latency textbfObject textbfDetector (FAOD)
arXiv Detail & Related papers (2024-12-05T13:23:06Z) - Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking [32.86991031493605]
Event-based bionic camera captures dynamic scenes with high temporal resolution and high dynamic range.
We propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters.
Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions.
arXiv Detail & Related papers (2024-09-26T06:12:08Z) - MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking [50.26836546224782]
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy.
The diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization.
This paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information.
arXiv Detail & Related papers (2024-04-18T11:09:25Z) - Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for
Scene Flow [17.23190429955172]
Single RGB or LiDAR is the mainstream sensor for the challenging scene flow.
Existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space.
We propose a novel hierarchical visual-motion fusion framework for scene flow.
arXiv Detail & Related papers (2024-03-12T09:15:19Z) - Revisiting Event-based Video Frame Interpolation [49.27404719898305]
Dynamic vision sensors or event cameras provide rich complementary information for video frame.
estimating optical flow from events is arguably more difficult than from RGB information.
We propose a divide-and-conquer strategy in which event-based intermediate frame synthesis happens incrementally in multiple simplified stages.
arXiv Detail & Related papers (2023-07-24T06:51:07Z) - Learning Spatial-Temporal Implicit Neural Representations for
Event-Guided Video Super-Resolution [9.431635577890745]
Event cameras sense the intensity changes asynchronously and produce event streams with high dynamic range and low latency.
This has inspired research endeavors utilizing events to guide the challenging video superresolution (VSR) task.
We make the first attempt to address a novel problem of achieving VSR at random scales by taking advantages of the high temporal resolution property of events.
arXiv Detail & Related papers (2023-03-24T02:42:16Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - CIR-Net: Cross-modality Interaction and Refinement for RGB-D Salient
Object Detection [144.66411561224507]
We present a convolutional neural network (CNN) model, named CIR-Net, based on the novel cross-modality interaction and refinement.
Our network outperforms the state-of-the-art saliency detectors both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-10-06T11:59:19Z) - RGB-Event Fusion for Moving Object Detection in Autonomous Driving [3.5397758597664306]
Moving Object Detection (MOD) is a critical vision task for successfully achieving safe autonomous driving.
Recent advances in sensor technologies, especially the Event camera, can naturally complement the conventional camera approach to better model moving objects.
We propose RENet, a novel RGB-Event fusion Network, that jointly exploits the two complementary modalities to achieve more robust MOD.
arXiv Detail & Related papers (2022-09-17T12:59:08Z) - Event Transformer [43.193463048148374]
Event camera's low power consumption and ability to capture microsecond brightness make it attractive for various computer vision tasks.
Existing event representation methods typically convert events into frames, voxel grids, or spikes for deep neural networks (DNNs)
This work introduces a novel token-based event representation, where each event is considered a fundamental processing unit termed an event-token.
arXiv Detail & Related papers (2022-04-11T15:05:06Z) - ProgressiveMotionSeg: Mutually Reinforced Framework for Event-Based
Motion Segmentation [101.19290845597918]
This paper presents a Motion Estimation (ME) module and an Event Denoising (ED) module jointly optimized in a mutually reinforced manner.
Taking temporal correlation as guidance, ED module calculates the confidence that each event belongs to real activity events, and transmits it to ME module to update energy function of motion segmentation for noise suppression.
arXiv Detail & Related papers (2022-03-22T13:40:26Z) - Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation.
We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.