Related papers: HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs

HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs

URL: http://arxiv.org/abs/2601.19839v1
Date: Tue, 27 Jan 2026 17:45:04 GMT
Title: HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs
Authors: Jeanne Malécot, Hamed Rahimi, Jeanne Cattoni, Marie Samson, Mouad Abrini, Mahdi Khoramshahi, Maribel Pino, Mohamed Chetouani,
Abstract summary: We present HARMONI, a multimodal personalization framework that enables socially assistive robots to manage long-term multi-user interactions.<n>The framework integrates four key modules: (i) a perception module that identifies active speakers and extracts multimodal input; (ii) a world modeling module that maintains representations of the environment and short-term conversational context; (iii) a user modeling module that updates long-term speaker-specific profiles; and (iv) a generation module that produces contextually grounded and ethically informed responses.
Score: 1.4755786263360526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing human-robot interaction systems often lack mechanisms for sustained personalization and dynamic adaptation in multi-user environments, limiting their effectiveness in real-world deployments. We present HARMONI, a multimodal personalization framework that leverages large language models to enable socially assistive robots to manage long-term multi-user interactions. The framework integrates four key modules: (i) a perception module that identifies active speakers and extracts multimodal input; (ii) a world modeling module that maintains representations of the environment and short-term conversational context; (iii) a user modeling module that updates long-term speaker-specific profiles; and (iv) a generation module that produces contextually grounded and ethically informed responses. Through extensive evaluation and ablation studies on four datasets, as well as a real-world scenario-driven user-study in a nursing home environment, we demonstrate that HARMONI supports robust speaker identification, online memory updating, and ethically aligned personalization, outperforming baseline LLM-driven approaches in user modeling accuracy, personalization quality, and user satisfaction.

Related papers

Dynamic Personality Adaptation in Large Language Models via State Machines [1.6986898305640261]
We propose a model-agnostic framework for dynamic personality simulation that employs state machines to represent latent personality states.<n>Part of our architecture is a modular pipeline for continuous personality scoring that evaluates dialogues along latent axes.<n>Results demonstrate that the system successfully adapts its personality state to user inputs, but also influences user behavior.
arXiv Detail & Related papers (2026-02-25T18:05:11Z)
AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding [73.05946667683259]
Recent large language models (MLLMs) show strong perception but struggle in multi-speaker, dialogue-centric settings.<n>We introduce AMUSE, a benchmark designed around tasks that are inherently agentic.<n>We propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation.
arXiv Detail & Related papers (2025-12-18T07:01:47Z)
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue [35.99134148462425]
We introduce Interactive Omni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction.<n>To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks.<n>We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding.
arXiv Detail & Related papers (2025-10-15T16:52:48Z)
FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z)
ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection [21.75681306780917]
This paper introduces a novel framework focusing on data augmentation in robotic assistance scenarios.<n>It involves leveraging a sophisticated large language model to simulate potential conversations and environmental contexts.<n>The additionally generated data serves to refine the latest multimodal models, enabling them to more accurately determine appropriate actions.
arXiv Detail & Related papers (2025-06-16T19:58:54Z)
Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)<n>It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.<n>The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z)
LEGENT: Open Platform for Embodied Agents [60.71847900126832]
We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs) LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface. In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
arXiv Detail & Related papers (2024-04-28T16:50:12Z)
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z)
Chat with the Environment: Interactive Multimodal Perception Using Large Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning. Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions [33.67477398036821]
We present Dyadformer, a novel multi-modal multi-subject Transformer architecture to model individual and interpersonal features in dyadic interactions. Our proposed cross-subject layer allows the network to explicitly model interactions among subjects through attentional operations. This proof-of-concept approach shows how multi-modality and joint modeling of both interactants for longer periods of time helps to predict individual attributes.
arXiv Detail & Related papers (2021-09-20T12:45:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.