Related papers: I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

URL: http://arxiv.org/abs/2311.08957v1
Date: Wed, 15 Nov 2023 13:47:00 GMT
Title: I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots
Authors: Giulio Antonio Abbo and Tony Belpaeme
Abstract summary: This paper presents an initial implementation of a dialogue manager that enhances the traditional text-based prompts with real-time visual input. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency.
Score: 0.040792653193642496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues.

Related papers

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [105.88658935310605]
We propose a multi-stage training methodology that progressively trains LLM to understand both visual and speech information. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities.
arXiv Detail & Related papers (2025-01-03T18:59:52Z)
Prior Lessons of Incremental Dialogue and Robot Action Management for the Age of Language Models [16.34485107181007]
Efforts towards endowing robots with the ability to speak have benefited from recent advancements in natural language processing. Current language models are not fully incremental, as their processing is inherently monotonic. This monotonicity has important implications for the development of dialogue systems for human--robot interaction.
arXiv Detail & Related papers (2025-01-01T20:58:03Z)
WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z)
A Graph-to-Text Approach to Knowledge-Grounded Response Generation in Human-Robot Interaction [2.3590037806133024]
This paper presents a novel conversational model for human--robot interaction that rests upon a graph-based representation of the dialogue state. The neural conversational model employed to respond to user utterances relies on a simple but effective graph-to-text mechanism. The proposed approach is empirically evaluated through a user study with a humanoid robot.
arXiv Detail & Related papers (2023-11-03T15:44:28Z)
Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs. We employ domain-adaptive training strategies to help the model adapt to the dialogue domains. Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z)
Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review [40.49926141538684]
Visual Context Augmented Dialogue System (VAD) has the potential to communicate with humans by perceiving and understanding multimodal information. VAD possesses the potential to generate engaging and context-aware responses.
arXiv Detail & Related papers (2022-07-02T09:31:37Z)
A Review of Dialogue Systems: From Trained Monkeys to Stochastic Parrots [0.0]
We aim to deploy artificial intelligence to build automated dialogue agents that can converse with humans. We present a broad overview of methods developed to build dialogue systems over the years.
arXiv Detail & Related papers (2021-11-02T08:07:55Z)
Advances in Multi-turn Dialogue Comprehension: A Survey [51.215629336320305]
Training machines to understand natural language and interact with humans is an elusive and essential task of artificial intelligence. This paper reviews the previous methods from the technical perspective of dialogue modeling for the dialogue comprehension task. In addition, we categorize dialogue-related pre-training techniques which are employed to enhance PrLMs in dialogue scenarios.
arXiv Detail & Related papers (2021-10-11T03:52:37Z)
"How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken Conversations [87.95711406978157]
This work presents a new benchmark on spoken task-oriented conversations. We study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling. Our data set enables speech-based benchmarking of task-oriented dialogue systems.
arXiv Detail & Related papers (2021-09-28T04:51:04Z)
Advances in Multi-turn Dialogue Comprehension: A Survey [51.215629336320305]
We review the previous methods from the perspective of dialogue modeling. We discuss three typical patterns of dialogue modeling that are widely-used in dialogue comprehension tasks.
arXiv Detail & Related papers (2021-03-04T15:50:17Z)
Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.