Towards Multimodal Social Conversations with Robots: Using Vision-Language Models
- URL: http://arxiv.org/abs/2507.19196v1
- Date: Fri, 25 Jul 2025 12:06:53 GMT
- Title: Towards Multimodal Social Conversations with Robots: Using Vision-Language Models
- Authors: Ruben Janssens, Tony Belpaeme,
- Abstract summary: We argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots.<n>We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.
- Score: 0.034530027457861996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.
Related papers
- Whom to Respond To? A Transformer-Based Model for Multi-Party Social Robot Interaction [4.276453870301421]
We propose a Transformer-based multi-task learning framework to improve the decision-making process of social robots.<n>We construct a novel multi-party HRI dataset that captures real-world complexities, such as gaze misalignment.<n>Our findings contribute to the development of socially intelligent social robots capable of engaging in natural and context-aware multi-party interactions.
arXiv Detail & Related papers (2025-07-15T03:42:14Z) - Enhancing Explainability with Multimodal Context Representations for Smarter Robots [0.0]
Key issue in Human-Robot Interaction is enabling robots to effectively perceive and reason over multimodal inputs, such as audio and vision.<n>We propose a generalized and explainable multimodal framework for context representation, designed to improve the fusion of speech and vision modalities.
arXiv Detail & Related papers (2025-02-28T13:36:47Z) - Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding [85.63710017456792]
FuSe is a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities.<n>We show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound.<n>Experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.
arXiv Detail & Related papers (2025-01-08T18:57:33Z) - Advancing Social Intelligence in AI Agents: Technical Challenges and Open Questions [67.60397632819202]
Building socially-intelligent AI agents (Social-AI) is a multidisciplinary, multimodal research goal.
We identify a set of underlying technical challenges and open questions for researchers across computing communities to advance Social-AI.
arXiv Detail & Related papers (2024-04-17T02:57:42Z) - Socially Pertinent Robots in Gerontological Healthcare [78.35311825198136]
This paper is an attempt to partially answer the question, via two waves of experiments with patients and companions in a day-care gerontological facility in Paris with a full-sized humanoid robot endowed with social and conversational interaction capabilities.<n>Overall, the users are receptive to this technology, especially when the robot perception and action skills are robust to environmental clutter and flexible to handle a plethora of different interactions.
arXiv Detail & Related papers (2024-04-11T08:43:37Z) - SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM)
SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z) - Developing Social Robots with Empathetic Non-Verbal Cues Using Large
Language Models [2.5489046505746704]
We design and label four types of empathetic non-verbal cues, abbreviated as SAFE: Speech, Action (gesture), Facial expression, and Emotion, in a social robot.
Preliminary results show distinct patterns in the robot's responses, such as a preference for calm and positive social emotions like 'joy' and 'lively', and frequent nodding gestures.
Our work lays the groundwork for future studies on human-robot interactions, emphasizing the essential role of both verbal and non-verbal cues in creating social and empathetic robots.
arXiv Detail & Related papers (2023-08-31T08:20:04Z) - Proceeding of the 1st Workshop on Social Robots Personalisation At the
crossroads between engineering and humanities (CONCATENATE) [37.838596863193565]
This workshop aims to raise an interdisciplinary discussion on personalisation in robotics.
It aims at bringing researchers from different fields together to propose guidelines for personalisation.
arXiv Detail & Related papers (2023-07-10T11:11:24Z) - Face-to-Face Contrastive Learning for Social Intelligence
Question-Answering [55.90243361923828]
multimodal methods have set the state of the art on many tasks, but have difficulty modeling the complex face-to-face conversational dynamics.
We propose Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions.
We experimentally evaluated the challenging Social-IQ dataset and show state-of-the-art results.
arXiv Detail & Related papers (2022-07-29T20:39:44Z) - SocialAI: Benchmarking Socio-Cognitive Abilities in Deep Reinforcement
Learning Agents [23.719833581321033]
Building embodied autonomous agents capable of participating in social interactions with humans is one of the main challenges in AI.
We argue that aiming towards human-level AI requires a broader set of key social skills.
We present SocialAI, a benchmark to assess the acquisition of social skills of DRL agents.
arXiv Detail & Related papers (2021-07-02T10:39:18Z) - Can You be More Social? Injecting Politeness and Positivity into
Task-Oriented Conversational Agents [60.27066549589362]
Social language used by human agents is associated with greater users' responsiveness and task completion.
The model uses a sequence-to-sequence deep learning architecture, extended with a social language understanding element.
Evaluation in terms of content preservation and social language level using both human judgment and automatic linguistic measures shows that the model can generate responses that enable agents to address users' issues in a more socially appropriate way.
arXiv Detail & Related papers (2020-12-29T08:22:48Z) - Spoken Language Interaction with Robots: Research Issues and
Recommendations, Report from the NSF Future Directions Workshop [0.819605661841562]
Meeting human needs requires addressing new challenges in speech technology and user experience design.
More powerful adaptation methods are needed, without extensive re-engineering or the collection of massive training data.
Since robots operate in real time, their speech and language processing components must also.
arXiv Detail & Related papers (2020-11-11T03:45:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.