I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in
Social Robots
- URL: http://arxiv.org/abs/2311.08957v1
- Date: Wed, 15 Nov 2023 13:47:00 GMT
- Title: I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in
Social Robots
- Authors: Giulio Antonio Abbo and Tony Belpaeme
- Abstract summary: This paper presents an initial implementation of a dialogue manager that enhances the traditional text-based prompts with real-time visual input.
The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency.
- Score: 0.040792653193642496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the rapidly evolving landscape of human-computer interaction, the
integration of vision capabilities into conversational agents stands as a
crucial advancement. This paper presents an initial implementation of a
dialogue manager that leverages the latest progress in Large Language Models
(e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with
real-time visual input. LLMs are used to interpret both textual prompts and
visual stimuli, creating a more contextually aware conversational agent. The
system's prompt engineering, incorporating dialogue with summarisation of the
images, ensures a balance between context preservation and computational
efficiency. Six interactions with a Furhat robot powered by this system are
reported, illustrating and discussing the results obtained. By implementing
this vision-enabled dialogue system, the paper envisions a future where
conversational agents seamlessly blend textual and visual modalities, enabling
richer, more context-aware dialogues.
Related papers
- VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [105.88658935310605]
We propose a multi-stage training methodology that progressively trains LLM to understand both visual and speech information.
Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities.
By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities.
arXiv Detail & Related papers (2025-01-03T18:59:52Z) - Incremental Dialogue Management: Survey, Discussion, and Implications for HRI [16.34485107181007]
We review the literature on interactive systems that operate incrementally (i.e., at the word level or below it).
We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation.
We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and the implications of incremental dialogue for embodied, robotic platforms.
arXiv Detail & Related papers (2025-01-01T20:58:03Z) - WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - A Graph-to-Text Approach to Knowledge-Grounded Response Generation in
Human-Robot Interaction [2.3590037806133024]
This paper presents a novel conversational model for human--robot interaction that rests upon a graph-based representation of the dialogue state.
The neural conversational model employed to respond to user utterances relies on a simple but effective graph-to-text mechanism.
The proposed approach is empirically evaluated through a user study with a humanoid robot.
arXiv Detail & Related papers (2023-11-03T15:44:28Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Enabling Harmonious Human-Machine Interaction with Visual-Context
Augmented Dialogue System: A Review [40.49926141538684]
Visual Context Augmented Dialogue System (VAD) has the potential to communicate with humans by perceiving and understanding multimodal information.
VAD possesses the potential to generate engaging and context-aware responses.
arXiv Detail & Related papers (2022-07-02T09:31:37Z) - Advances in Multi-turn Dialogue Comprehension: A Survey [51.215629336320305]
Training machines to understand natural language and interact with humans is an elusive and essential task of artificial intelligence.
This paper reviews the previous methods from the technical perspective of dialogue modeling for the dialogue comprehension task.
In addition, we categorize dialogue-related pre-training techniques which are employed to enhance PrLMs in dialogue scenarios.
arXiv Detail & Related papers (2021-10-11T03:52:37Z) - Advances in Multi-turn Dialogue Comprehension: A Survey [51.215629336320305]
We review the previous methods from the perspective of dialogue modeling.
We discuss three typical patterns of dialogue modeling that are widely-used in dialogue comprehension tasks.
arXiv Detail & Related papers (2021-03-04T15:50:17Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.