Efficient Egocentric Action Recognition with Multimodal Data
- URL: http://arxiv.org/abs/2506.01757v1
- Date: Mon, 02 Jun 2025 15:04:23 GMT
- Title: Efficient Egocentric Action Recognition with Multimodal Data
- Authors: Marco Calzavara, Ard Kastrati, Matteo Macchini, Dushan Vasilevski, Roger Wattenhofer,
- Abstract summary: We analyze the impact of sampling frequency across different input modalities on egocentric action recognition performance and CPU usage.<n>Our findings reveal that reducing the sampling rate of RGB frames, when complemented with higher-frequency 3D hand pose input, can preserve high accuracy while significantly lowering CPU demands.<n>This highlights the potential of multimodal input strategies as a viable approach to achieving efficient, real-time EAR on XR devices.
- Score: 19.70664397400233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing availability of wearable XR devices opens new perspectives for Egocentric Action Recognition (EAR) systems, which can provide deeper human understanding and situation awareness. However, deploying real-time algorithms on these devices can be challenging due to the inherent trade-offs between portability, battery life, and computational resources. In this work, we systematically analyze the impact of sampling frequency across different input modalities - RGB video and 3D hand pose - on egocentric action recognition performance and CPU usage. By exploring a range of configurations, we provide a comprehensive characterization of the trade-offs between accuracy and computational efficiency. Our findings reveal that reducing the sampling rate of RGB frames, when complemented with higher-frequency 3D hand pose input, can preserve high accuracy while significantly lowering CPU demands. Notably, we observe up to a 3x reduction in CPU usage with minimal to no loss in recognition performance. This highlights the potential of multimodal input strategies as a viable approach to achieving efficient, real-time EAR on XR devices.
Related papers
- FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation [52.89847760590189]
3D scene understanding is a critical yet challenging task in autonomous driving.<n>Recent methods leverage the range-view representation to improve processing efficiency.<n>We re-design the workflow for range-view-based LiDAR semantic segmentation.
arXiv Detail & Related papers (2025-02-13T12:39:26Z) - GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving [9.023864430027333]
We propose a 3D Gaussian Splatting-based multimodal place recognition network dubbed GPSR.<n>It explicitly combines multi-view RGB images and LiDAR point clouds into atemporally unified scene representation with the Multimodal Gaussian Splatting.<n>Our method can effectively leverage complementary strengths of both multi-view cameras and LiDAR, achieving SOTA place recognition performance while maintaining solid generalization ability.
arXiv Detail & Related papers (2024-10-01T00:43:45Z) - Cross-Cluster Shifting for Efficient and Effective 3D Object Detection
in Autonomous Driving [69.20604395205248]
We present a new 3D point-based detector model, named Shift-SSD, for precise 3D object detection in autonomous driving.
We introduce an intriguing Cross-Cluster Shifting operation to unleash the representation capacity of the point-based detector.
We conduct extensive experiments on the KITTI, runtime, and nuScenes datasets, and the results demonstrate the state-of-the-art performance of Shift-SSD.
arXiv Detail & Related papers (2024-03-10T10:36:32Z) - Agile gesture recognition for capacitive sensing devices: adapting
on-the-job [55.40855017016652]
We demonstrate a hand gesture recognition system that uses signals from capacitive sensors embedded into the etee hand controller.
The controller generates real-time signals from each of the wearer five fingers.
We use a machine learning technique to analyse the time series signals and identify three features that can represent 5 fingers within 500 ms.
arXiv Detail & Related papers (2023-05-12T17:24:02Z) - Efficient Human Vision Inspired Action Recognition using Adaptive
Spatiotemporal Sampling [13.427887784558168]
We introduce a novel adaptive vision system for efficient action recognition processing.
Our system pre-scans the global context sampling scheme at low-resolution and decides to skip or request high-resolution features at salient regions for further processing.
We validate the system on EPIC-KENS and UCF-101 datasets for action recognition, and show that our proposed approach can greatly speed up inference with a tolerable loss of accuracy compared with those from state-the-art baselines.
arXiv Detail & Related papers (2022-07-12T01:18:58Z) - 3D Adapted Random Forest Vision (3DARFV) for Untangling
Heterogeneous-Fabric Exceeding Deep Learning Semantic Segmentation Efficiency
at the Utmost Accuracy [1.6020567943077142]
Analyzing 3D images requires many computations, causing efficiency to suffer lengthy processing time alongside large energy consumption.
This paper demonstrates the semantic segmentation capability of a probabilistic decision tree algorithm, 3D Adapted Random Forest Vision (3DARFV), exceeding deep learning algorithm efficiency at the utmost accuracy.
arXiv Detail & Related papers (2022-03-23T15:05:23Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - Neural Disparity Refinement for Arbitrary Resolution Stereo [67.55946402652778]
We introduce a novel architecture for neural disparity refinement aimed at facilitating deployment of 3D computer vision on cheap and widespread consumer devices.
Our approach relies on a continuous formulation that enables to estimate a refined disparity map at any arbitrary output resolution.
arXiv Detail & Related papers (2021-10-28T18:00:00Z) - Dynamic Resource-aware Corner Detection for Bio-inspired Vision Sensors [0.9988653233188148]
We present an algorithm to detect asynchronous corners from a stream of events in real-time on embedded systems.
The proposed algorithm is capable of selecting the best corner candidate among neighbors and achieves an average execution time savings of 59 %.
arXiv Detail & Related papers (2020-10-29T12:01:33Z) - YOLOpeds: Efficient Real-Time Single-Shot Pedestrian Detection for Smart
Camera Applications [2.588973722689844]
This work addresses the challenge of achieving a good trade-off between accuracy and speed for efficient deployment of deep-learning-based pedestrian detection in smart camera applications.
A computationally efficient architecture is introduced based on separable convolutions and proposes integrating dense connections across layers and multi-scale feature fusion.
Overall, YOLOpeds provides real-time sustained operation of over 30 frames per second with detection rates in the range of 86% outperforming existing deep learning models.
arXiv Detail & Related papers (2020-07-27T09:50:11Z) - LE-HGR: A Lightweight and Efficient RGB-based Online Gesture Recognition
Network for Embedded AR Devices [8.509059894058947]
We propose a lightweight and computationally efficient HGR framework, namely LE-HGR, to enable real-time gesture recognition on embedded devices with low computing power.
We show that the proposed method is of high accuracy and robustness, which is able to reach high-end performance in a variety of complicated interaction environments.
arXiv Detail & Related papers (2020-01-16T05:23:24Z) - Spatial-Spectral Residual Network for Hyperspectral Image
Super-Resolution [82.1739023587565]
We propose a novel spectral-spatial residual network for hyperspectral image super-resolution (SSRNet)
Our method can effectively explore spatial-spectral information by using 3D convolution instead of 2D convolution, which enables the network to better extract potential information.
In each unit, we employ spatial and temporal separable 3D convolution to extract spatial and spectral information, which not only reduces unaffordable memory usage and high computational cost, but also makes the network easier to train.
arXiv Detail & Related papers (2020-01-14T03:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.