HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs
- URL: http://arxiv.org/abs/2601.19839v1
- Date: Tue, 27 Jan 2026 17:45:04 GMT
- Title: HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs
- Authors: Jeanne Malécot, Hamed Rahimi, Jeanne Cattoni, Marie Samson, Mouad Abrini, Mahdi Khoramshahi, Maribel Pino, Mohamed Chetouani,
- Abstract summary: We present HARMONI, a multimodal personalization framework that enables socially assistive robots to manage long-term multi-user interactions.<n>The framework integrates four key modules: (i) a perception module that identifies active speakers and extracts multimodal input; (ii) a world modeling module that maintains representations of the environment and short-term conversational context; (iii) a user modeling module that updates long-term speaker-specific profiles; and (iv) a generation module that produces contextually grounded and ethically informed responses.
- Score: 1.4755786263360526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing human-robot interaction systems often lack mechanisms for sustained personalization and dynamic adaptation in multi-user environments, limiting their effectiveness in real-world deployments. We present HARMONI, a multimodal personalization framework that leverages large language models to enable socially assistive robots to manage long-term multi-user interactions. The framework integrates four key modules: (i) a perception module that identifies active speakers and extracts multimodal input; (ii) a world modeling module that maintains representations of the environment and short-term conversational context; (iii) a user modeling module that updates long-term speaker-specific profiles; and (iv) a generation module that produces contextually grounded and ethically informed responses. Through extensive evaluation and ablation studies on four datasets, as well as a real-world scenario-driven user-study in a nursing home environment, we demonstrate that HARMONI supports robust speaker identification, online memory updating, and ethically aligned personalization, outperforming baseline LLM-driven approaches in user modeling accuracy, personalization quality, and user satisfaction.
Related papers
- Dynamic Personality Adaptation in Large Language Models via State Machines [1.6986898305640261]
We propose a model-agnostic framework for dynamic personality simulation that employs state machines to represent latent personality states.<n>Part of our architecture is a modular pipeline for continuous personality scoring that evaluates dialogues along latent axes.<n>Results demonstrate that the system successfully adapts its personality state to user inputs, but also influences user behavior.
arXiv Detail & Related papers (2026-02-25T18:05:11Z) - AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding [73.05946667683259]
Recent large language models (MLLMs) show strong perception but struggle in multi-speaker, dialogue-centric settings.<n>We introduce AMUSE, a benchmark designed around tasks that are inherently agentic.<n>We propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation.
arXiv Detail & Related papers (2025-12-18T07:01:47Z) - InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue [35.99134148462425]
We introduce Interactive Omni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction.<n>To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks.<n>We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding.
arXiv Detail & Related papers (2025-10-15T16:52:48Z) - FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z) - ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection [21.75681306780917]
This paper introduces a novel framework focusing on data augmentation in robotic assistance scenarios.<n>It involves leveraging a sophisticated large language model to simulate potential conversations and environmental contexts.<n>The additionally generated data serves to refine the latest multimodal models, enabling them to more accurately determine appropriate actions.
arXiv Detail & Related papers (2025-06-16T19:58:54Z) - Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)<n>It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.<n>The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z) - LEGENT: Open Platform for Embodied Agents [60.71847900126832]
We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs)
LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface.
In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
arXiv Detail & Related papers (2024-04-28T16:50:12Z) - MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic
Interactions [33.67477398036821]
We present Dyadformer, a novel multi-modal multi-subject Transformer architecture to model individual and interpersonal features in dyadic interactions.
Our proposed cross-subject layer allows the network to explicitly model interactions among subjects through attentional operations.
This proof-of-concept approach shows how multi-modality and joint modeling of both interactants for longer periods of time helps to predict individual attributes.
arXiv Detail & Related papers (2021-09-20T12:45:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.