Related papers: Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems

Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems

URL: http://arxiv.org/abs/2512.10975v1
Date: Tue, 02 Dec 2025 21:47:00 GMT
Title: Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems
Authors: Matvey Nepomnyaschiy, Oleg Pereziabov, Anvar Tliamov, Stanislav Mikhailov, Ilya Afanasyev,
Abstract summary: We propose a novel multi-agent framework for training multimodal emotion recognition systems.<n>This architecture enables modular integration of new modalities.<n>We demonstrate the feasibility of our approach through a proof-of-concept implementation.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Effective human-agent interaction (HAI) relies on accurate and adaptive perception of human emotional states. While multimodal deep learning models - leveraging facial expressions, speech, and textual cues - offer high accuracy in emotion recognition, their training and maintenance are often computationally intensive and inflexible to modality changes. In this work, we propose a novel multi-agent framework for training multimodal emotion recognition systems, where each modality encoder and the fusion classifier operate as autonomous agents coordinated by a central supervisor. This architecture enables modular integration of new modalities (e.g., audio features via emotion2vec), seamless replacement of outdated components, and reduced computational overhead during training. We demonstrate the feasibility of our approach through a proof-of-concept implementation supporting vision, audio, and text modalities, with the classifier serving as a shared decision-making agent. Our framework not only improves training efficiency but also contributes to the design of more flexible, scalable, and maintainable perception modules for embodied and virtual agents in HAI scenarios.

Related papers

MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning [53.37068897861388]
MedSAM-Agent is a framework that reformulates interactive segmentation as a multi-step autonomous decision-making process.<n>We develop a two-stage training pipeline that integrates multi-turn, end-to-end outcome verification.<n>Experiments across 6 medical modalities and 21 datasets demonstrate that MedSAM-Agent achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-03T09:47:49Z)
Multiagent Reinforcement Learning with Neighbor Action Estimation [5.226225544973531]
This paper proposes an enhanced multiagent reinforcement learning framework that employs action estimation neural networks to infer agent behaviors.<n>At the engineering application level, this framework has been implemented and validated in dual-arm robotic manipulation tasks.
arXiv Detail & Related papers (2026-01-08T02:26:57Z)
Feature Aggregation for Efficient Continual Learning of Complex Facial Expressions [0.0]
We propose a hybrid framework for facial expression recognition (FER)<n>We show that our model can first learn basic expressions and then progressively recognize compound expressions.<n>Experiments demonstrate improved accuracy, stronger knowledge retention, and reduced forgetting.
arXiv Detail & Related papers (2025-12-13T10:39:17Z)
Multi-Modal Manipulation via Multi-Modal Policy Consensus [62.49978559936122]
We propose a new approach to integrate diverse sensory modalities for robotic manipulation.<n>Our method factorizes the policy into a set of diffusion models, each specialized for a single representation.<n>We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion.
arXiv Detail & Related papers (2025-09-27T19:43:04Z)
Foundation Model for Skeleton-Based Human Action Understanding [56.89025287217221]
This paper presents a Unified Skeleton-based Dense Representation Learning framework.<n>USDRL consists of a Transformer-based Dense Spatio-Temporal (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT)
arXiv Detail & Related papers (2025-08-18T02:42:16Z)
Emotion Recognition with CLIP and Sequential Learning [5.66758879852618]
We present our innovative methodology for tackling the Valence-Arousal (VA) Estimation Challenge, the Expression Recognition Challenge, and the Action Unit (AU) Detection Challenge.<n>Our approach introduces a novel framework aimed at enhancing continuous emotion recognition.
arXiv Detail & Related papers (2025-03-13T01:02:06Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
Versatile audio-visual learning for emotion recognition [28.26077129002198]
This study proposes a versatile audio-visual learning framework for handling unimodal and multimodal systems. We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Notably, VAVL attains a new state-of-the-art performance in the emotional prediction task on the MSP-IMPROV corpus.
arXiv Detail & Related papers (2023-05-12T03:13:37Z)
i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.