Related papers: Real-Time Multi-Modal Embedded Vision Framework for Object Detection Facial Emotion Recognition and Biometric Identification on Low-Power Edge Platforms

Real-Time Multi-Modal Embedded Vision Framework for Object Detection Facial Emotion Recognition and Biometric Identification on Low-Power Edge Platforms

URL: http://arxiv.org/abs/2601.11970v1
Date: Sat, 17 Jan 2026 09:06:47 GMT
Title: Real-Time Multi-Modal Embedded Vision Framework for Object Detection Facial Emotion Recognition and Biometric Identification on Low-Power Edge Platforms
Authors: S. M. Khalid Bin Zahid, Md. Rakibul Hasan Nishat, Abdul Hasib, Md. Rakibul Hasan, Md. Ashiqussalehin, Md. Sahadat Hossen Sajib, A. S. M. Ahsanul Sarkar Akib,
Abstract summary: We present a real-time multi-modal vision framework that integrates object detection, owner-specific face recognition, and emotion detection into a unified pipeline deployed on a Raspberry Pi 5 edge platform.<n>Our work demonstrates that context-aware scheduling is the key to unlocking complex multi-modal AI on cost-effective edge hardware.
Score: 0.44219509596259216
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Intelligent surveillance systems often handle perceptual tasks such as object detection, facial recognition, and emotion analysis independently, but they lack a unified, adaptive runtime scheduler that dynamically allocates computational resources based on contextual triggers. This limits their holistic understanding and efficiency on low-power edge devices. To address this, we present a real-time multi-modal vision framework that integrates object detection, owner-specific face recognition, and emotion detection into a unified pipeline deployed on a Raspberry Pi 5 edge platform. The core of our system is an adaptive scheduling mechanism that reduces computational load by 65\% compared to continuous processing by selectively activating modules such as, YOLOv8n for object detection, a custom FaceNet-based embedding system for facial recognition, and DeepFace's CNN for emotion classification. Experimental results demonstrate the system's efficacy, with the object detection module achieving an Average Precision (AP) of 0.861, facial recognition attaining 88\% accuracy, and emotion detection showing strong discriminatory power (AUC up to 0.97 for specific emotions), while operating at 5.6 frames per second. Our work demonstrates that context-aware scheduling is the key to unlocking complex multi-modal AI on cost-effective edge hardware, making intelligent perception more accessible and privacy-preserving.

Related papers

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z)
Deep Learning-Based Real-Time Sequential Facial Expression Analysis Using Geometric Features [1.0742675209112622]
This study presents a novel approach to real-time sequential facial expression recognition using deep learning and geometric features.<n>The proposed method utilizes MediaPipe FaceMesh for rapid and accurate facial landmark detection.<n>The approach demonstrated real-time applicability, processing approximately 165 frames per second on consumer-grade hardware.
arXiv Detail & Related papers (2025-12-05T12:26:31Z)
Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection? [57.000348519630286]
Recent advances in mobile edge computing have made it possible to offload-intensive object detection to edge servers equipped with high-accuracy neural networks.<n>This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking.<n>We propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection.
arXiv Detail & Related papers (2025-11-25T04:54:51Z)
AutoOEP -- A Multi-modal Framework for Online Exam Proctoring [1.6522310568442877]
This paper introduces AutoOEP (Automated Online Exam Proctoring), a comprehensive, multi-modal framework that leverages computer vision and machine learning to provide effective, automated proctoring.<n>The system utilizes a dual-camera setup to capture both a frontal view of the examinee and a side view of the workspace, minimizing blind spots.<n>The Hand Module employs a fine-tuned YOLOv11 model for detecting prohibited items (e.g., mobile phones, notes) and tracks hand proximity to these objects.
arXiv Detail & Related papers (2025-09-13T16:34:38Z)
Fast-COS: A Fast One-Stage Object Detector Based on Reparameterized Attention Vision Transformer for Autonomous Driving [3.617580194719686]
This paper introduces Fast-COS, a novel single-stage object detection framework crafted specifically for driving scenes.<n> RAViT achieves 81.4% Top-1 accuracy on the ImageNet-1K dataset.<n>It surpasses leading models in efficiency, delivering up to 75.9% faster GPU inference and 1.38 higher throughput on edge devices.
arXiv Detail & Related papers (2025-02-11T09:54:09Z)
Wandering around: A bioinspired approach to visual attention through object motion sensitivity [40.966228784674115]
Active vision enables dynamic visual perception, offering an alternative to static feedforward architectures in computer vision.<n>Event-based cameras, inspired by the mammalian retina, enhance this capability by capturing asynchronous scene changes.<n>To distinguish moving objects while the event-based camera is in motion the agent requires an object motion segmentation mechanism.<n>This work presents a Convolutional Neural Network bio-inspired attention system for selective attention through object motion sensitivity.
arXiv Detail & Related papers (2025-02-10T18:16:30Z)
Visual Agents as Fast and Slow Thinkers [88.1404921693082]
We introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents.<n>FaST employs a switch adapter to dynamically select between System 1/2 modes.<n>It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data.
arXiv Detail & Related papers (2024-08-16T17:44:02Z)
Improving Facial Landmark Detection Accuracy and Efficiency with Knowledge Distillation [4.779050216649159]
This paper introduces a novel approach to address these challenges through the development of a knowledge distillation method. Our goal is to design models capable of accurately locating facial landmarks under varying conditions. This method was successfully implemented and achieved a top 6th place finish out of 165 participants in the IEEE ICME 2024 PAIR competition.
arXiv Detail & Related papers (2024-04-09T05:30:58Z)
Agile gesture recognition for capacitive sensing devices: adapting on-the-job [55.40855017016652]
We demonstrate a hand gesture recognition system that uses signals from capacitive sensors embedded into the etee hand controller. The controller generates real-time signals from each of the wearer five fingers. We use a machine learning technique to analyse the time series signals and identify three features that can represent 5 fingers within 500 ms.
arXiv Detail & Related papers (2023-05-12T17:24:02Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
Identity-Aware Attribute Recognition via Real-Time Distributed Inference in Mobile Edge Clouds [53.07042574352251]
We design novel models for pedestrian attribute recognition with re-ID in an MEC-enabled camera monitoring system. We propose a novel inference framework with a set of distributed modules, by jointly considering the attribute recognition and person re-ID. We then devise a learning-based algorithm for the distributions of the modules of the proposed distributed inference framework.
arXiv Detail & Related papers (2020-08-12T12:03:27Z)
Continuous Emotion Recognition via Deep Convolutional Autoencoder and Support Vector Regressor [70.2226417364135]
It is crucial that the machine should be able to recognize the emotional state of the user with high accuracy. Deep neural networks have been used with great success in recognizing emotions. We present a new model for continuous emotion recognition based on facial expression recognition.
arXiv Detail & Related papers (2020-01-31T17:47:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.