Leveraging RGB Images for Pre-Training of Event-Based Hand Pose Estimation
- URL: http://arxiv.org/abs/2509.16949v1
- Date: Sun, 21 Sep 2025 07:07:49 GMT
- Title: Leveraging RGB Images for Pre-Training of Event-Based Hand Pose Estimation
- Authors: Ruicong Liu, Takehiko Ohkawa, Tze Ho Elden Tse, Mingfang Zhang, Angela Yao, Yoichi Sato,
- Abstract summary: RPEP is the first pre-training method for event-based 3D hand pose estimation using labeled RGB images and unpaired, unlabeled event data.<n>Our model significantly outperforms state-of-the-art methods on real event data, achieving up to 24% improvement on EvRealHands.
- Score: 64.8814078041756
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents RPEP, the first pre-training method for event-based 3D hand pose estimation using labeled RGB images and unpaired, unlabeled event data. Event data offer significant benefits such as high temporal resolution and low latency, but their application to hand pose estimation is still limited by the scarcity of labeled training data. To address this, we repurpose real RGB datasets to train event-based estimators. This is done by constructing pseudo-event-RGB pairs, where event data is generated and aligned with the ground-truth poses of RGB images. Unfortunately, existing pseudo-event generation techniques assume stationary objects, thus struggling to handle non-stationary, dynamically moving hands. To overcome this, RPEP introduces a novel generation strategy that decomposes hand movements into smaller, step-by-step motions. This decomposition allows our method to capture temporal changes in articulation, constructing more realistic event data for a moving hand. Additionally, RPEP imposes a motion reversal constraint, regularizing event generation using reversed motion. Extensive experiments show that our pre-trained model significantly outperforms state-of-the-art methods on real event data, achieving up to 24% improvement on EvRealHands. Moreover, it delivers strong performance with minimal labeled samples for fine-tuning, making it well-suited for practical deployment.
Related papers
- PEPR: Privileged Event-based Predictive Regularization for Domain Generalization [19.185122873391517]
We propose a cross-modal framework under the learning using privileged information (LUPI) paradigm for training a robust, single-modality RGB model.<n>We leverage event cameras as a source of privileged information, available only during training.<n>We train the RGB encoder with PEPR to predict event-based latent features, distilling robustness without sacrificing semantic richness.
arXiv Detail & Related papers (2026-02-04T14:10:36Z) - Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking [51.31378940976401]
Existing RGB-Event tracking approaches fail to fully exploit the unique advantages of event cameras.<n>We propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality.<n>Experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method.
arXiv Detail & Related papers (2026-01-03T01:10:17Z) - Frequency-Adaptive Low-Latency Object Detection Using Events and Frames [23.786369609995013]
Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments.<n>Two critical mismatches: low-latency Events textitvs.high-latency RGB frames, and temporally sparse labels in training textitvs.continuous flow in inference.<n>We propose the textbfFrequency-textbfAdaptive Low-Latency textbfObject textbfDetector (FAOD)
arXiv Detail & Related papers (2024-12-05T13:23:06Z) - Implicit Event-RGBD Neural SLAM [54.74363487009845]
Implicit neural SLAM has achieved remarkable progress recently.
Existing methods face significant challenges in non-ideal scenarios.
We propose EN-SLAM, the first event-RGBD implicit neural SLAM framework.
arXiv Detail & Related papers (2023-11-18T08:48:58Z) - EvDNeRF: Reconstructing Event Data with Dynamic Neural Radiance Fields [80.94515892378053]
EvDNeRF is a pipeline for generating event data and training an event-based dynamic NeRF.
NeRFs offer geometric-based learnable rendering, but prior work with events has only considered reconstruction of static scenes.
We show that by training on varied batch sizes of events, we can improve test-time predictions of events at fine time resolutions.
arXiv Detail & Related papers (2023-10-03T21:08:41Z) - Deformable Neural Radiance Fields using RGB and Event Cameras [65.40527279809474]
We develop a novel method to model the deformable neural radiance fields using RGB and event cameras.
The proposed method uses the asynchronous stream of events and sparse RGB frames.
Experiments conducted on both realistically rendered graphics and real-world datasets demonstrate a significant benefit of the proposed method.
arXiv Detail & Related papers (2023-09-15T14:19:36Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - Event-based Image Deblurring with Dynamic Motion Awareness [10.81953574179206]
We introduce the first dataset containing pairs of real RGB blur images and related events during the exposure time.
Our results show better robustness overall when using events, with improvements in PSNR by up to 1.57dB on synthetic data and 1.08 dB on real event data.
arXiv Detail & Related papers (2022-08-24T09:39:55Z) - Lifting Monocular Events to 3D Human Poses [22.699272716854967]
This paper presents a novel 3D human pose estimation approach using a single stream of asynchronous events as input.
We propose the first learning-based method for 3D human pose from a single stream of events.
Experiments demonstrate that our method achieves solid accuracy, narrowing the performance gap between standard RGB and event-based vision.
arXiv Detail & Related papers (2021-04-21T16:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.