Related papers: Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

URL: http://arxiv.org/abs/2510.18036v1
Date: Mon, 20 Oct 2025 19:18:22 GMT
Title: Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware
Authors: Stavros Mitsis, Ermos Hadjikyriakos, Humaid Ibrahim, Savvas Neofytou, Shashwat Raman, James Myles, Eiman Kanjo,
Abstract summary: Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices.<n>This paper presents a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU.
Score: 0.4104352271917982
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.

Related papers

LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics [0.6119773373677944]
We present LiteVLA-Edge, a deployment-oriented VLA pipeline for fully on-device inference on Jetson Orin-class hardware.<n>Our approach combines supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference.<n>Under our configuration, LiteVLA-Edge achieves a mean end-to-end runtime of 150.5,ms (approximately 6.6,Hz) while operating entirely offline.
arXiv Detail & Related papers (2026-03-03T03:20:52Z)
Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z)
Vision-Language Models on the Edge for Real-Time Robotic Perception [0.22940141855172028]
Edge intelligence within 6G, particularly Open RAN and Multi-access Edge Computing, offers a pathway to address these challenges.<n>This work investigates the deployment of Vision-Language Models on ORAN/MEC infrastructure using the Unitree G1 humanoid robot as an embodied testbed.<n>Our results show that edge deployment preserves near-cloud accuracy while reducing end-to-end latency by 5%.
arXiv Detail & Related papers (2026-01-21T12:09:48Z)
Generative AI for Video Translation: A Scalable Architecture for Multilingual Video Conferencing [0.21748200848556343]
Real-time deployment of cascaded generative AI pipelines for applications like video translation is constrained by significant system-level challenges.<n>This paper proposes and evaluates a practical system-level framework designed to mitigate these critical bottlenecks.<n>The proposed architecture incorporates a turn-taking mechanism to reduce computational complexity from quadratic to linear in multi-user scenarios.
arXiv Detail & Related papers (2025-12-15T21:21:09Z)
Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining [95.00432497331583]
Multi-Prior Hierarchical Mamba (MPHM) network for image deraining.<n>MPHM integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information.<n>Experiments demonstrate MPHM's state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset.
arXiv Detail & Related papers (2025-11-17T08:08:59Z)
Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition [54.44798086835314]
Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions.<n>We propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations.<n> Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks.
arXiv Detail & Related papers (2025-09-10T10:18:56Z)
Designing Practical Models for Isolated Word Visual Speech Recognition [9.502316537342372]
Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data.<n>Practical applications of such systems include medical assistance as well as human-machine interactions.<n>We develop lightweight end-to-end architectures by first efficient models from the image classification literature, and then adopting lightweight block designs in a temporal convolution network backbone.
arXiv Detail & Related papers (2025-08-25T11:04:36Z)
Real-Time Emergency Vehicle Siren Detection with Efficient CNNs on Embedded Hardware [0.26249027950824516]
We present a full-stack emergency vehicle siren detection system designed for real-time deployment on embedded hardware.<n>The proposed approach is based on E2PANNs, a fine-tuned convolutional neural network derived from EPANNs.<n>A remote WebSocket interface provides real-time monitoring and facilitates live demonstration capabilities.
arXiv Detail & Related papers (2025-07-02T10:27:41Z)
SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition [17.568724398229232]
Speech emotion recognition (SER) plays an important role in emotional states from deciphering speech signals.<n>This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER.<n>It exploits the capabilities of wavelets for effective localization in both time and frequency domains.
arXiv Detail & Related papers (2025-02-01T04:18:06Z)
Turbocharge Speech Understanding with Pilot Inference [0.9699101045941684]
This paper sets to accelerate modern speech understanding on resource-constrained edge devices. It takes a hybrid approach: to speed up on-device execution; to offload inputs that are beyond the device's capacity. Our prototype, called PASU, is tested on Arm platforms with 6 - 8 cores: it delivers SOTA accuracy; it reduces the end-to-end latency by 2x and reduces the offloading needs by 2x.
arXiv Detail & Related papers (2023-11-22T17:14:18Z)
MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process. We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. One of the main challenges comes from the real-time implementation of these algorithms. This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL) We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.