Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning
- URL: http://arxiv.org/abs/2509.24968v1
- Date: Mon, 29 Sep 2025 16:00:50 GMT
- Title: Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning
- Authors: Donghwa Kang, Junho Kim, Dongwoo Kang,
- Abstract summary: Event cameras offer unique advantages for facial keypoint alignment under challenging conditions.<n>We propose a novel framework based on cross-modal fusion attention (CMFA) and self-supervised multi-event representation learning (SSMER) for event-based facial keypoint alignment.
- Score: 16.170645576584487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Event cameras offer unique advantages for facial keypoint alignment under challenging conditions, such as low light and rapid motion, due to their high temporal resolution and robustness to varying illumination. However, existing RGB facial keypoint alignment methods do not perform well on event data, and training solely on event data often leads to suboptimal performance because of its limited spatial information. Moreover, the lack of comprehensive labeled event datasets further hinders progress in this area. To address these issues, we propose a novel framework based on cross-modal fusion attention (CMFA) and self-supervised multi-event representation learning (SSMER) for event-based facial keypoint alignment. Our framework employs CMFA to integrate corresponding RGB data, guiding the model to extract robust facial features from event input images. In parallel, SSMER enables effective feature learning from unlabeled event data, overcoming spatial limitations. Extensive experiments on our real-event E-SIE dataset and a synthetic-event version of the public WFLW-V benchmark show that our approach consistently surpasses state-of-the-art methods across multiple evaluation metrics.
Related papers
- EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition [54.55914886780534]
Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion.<n>We introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR.<n>EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios.
arXiv Detail & Related papers (2026-02-13T13:25:05Z) - Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking [51.31378940976401]
Existing RGB-Event tracking approaches fail to fully exploit the unique advantages of event cameras.<n>We propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality.<n>Experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method.
arXiv Detail & Related papers (2026-01-03T01:10:17Z) - Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection [56.88160531995454]
Existing RGB-Event detection methods process the low-information regions of both modalities uniformly during feature extraction and fusion.<n>We propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features.<n>Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency.
arXiv Detail & Related papers (2025-09-04T04:18:46Z) - STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data [4.351581973358463]
Transformer-based approach, STaRFormer, serves as a universal framework for sequential modeling.<n> STaRFormer employs a novel, dynamic attention-based regional masking scheme combined with semi-supervised contrastive learning to enhance task-specific latent representations.
arXiv Detail & Related papers (2025-04-14T11:03:19Z) - Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation [47.75348821902489]
Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time.<n>Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios.<n>This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality.
arXiv Detail & Related papers (2025-01-01T13:40:09Z) - Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking [32.86991031493605]
Event-based bionic camera captures dynamic scenes with high temporal resolution and high dynamic range.
We propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters.
Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions.
arXiv Detail & Related papers (2024-09-26T06:12:08Z) - EventZoom: A Progressive Approach to Event-Based Data Augmentation for Enhanced Neuromorphic Vision [9.447299017563841]
Dynamic Vision Sensors (DVS) capture event data with high temporal resolution and low power consumption.<n>Event data augmentation serve as an essential method for overcoming the limitation of scale and diversity in event datasets.
arXiv Detail & Related papers (2024-05-29T08:39:31Z) - Implicit Event-RGBD Neural SLAM [54.74363487009845]
Implicit neural SLAM has achieved remarkable progress recently.
Existing methods face significant challenges in non-ideal scenarios.
We propose EN-SLAM, the first event-RGBD implicit neural SLAM framework.
arXiv Detail & Related papers (2023-11-18T08:48:58Z) - Generalizing Event-Based Motion Deblurring in Real-World Scenarios [62.995994797897424]
Event-based motion deblurring has shown promising results by exploiting low-latency events.
We propose a scale-aware network that allows flexible input spatial scales and enables learning from different temporal scales of motion blur.
A two-stage self-supervised learning scheme is then developed to fit real-world data distribution.
arXiv Detail & Related papers (2023-08-11T04:27:29Z) - A Differentiable Recurrent Surface for Asynchronous Event-Based Data [19.605628378366667]
We propose Matrix-LSTM, a grid of Long Short-Term Memory (LSTM) cells that efficiently process events and learn end-to-end task-dependent event-surfaces.
Compared to existing reconstruction approaches, our learned event-surface shows good flexibility and on optical flow estimation.
It improves the state-of-the-art of event-based object classification on the N-Cars dataset.
arXiv Detail & Related papers (2020-01-10T14:09:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.