Related papers: Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents

Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents

URL: http://arxiv.org/abs/2410.14141v1
Date: Fri, 18 Oct 2024 03:26:06 GMT
Title: Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents
Authors: Sabit Hassan, Hye-Young Chung, Xiang Zhi Tan, Malihe Alikhani,
Abstract summary: M-CoDAL is a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations. Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images. Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation.
Score: 23.960719833886984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When assisting people in daily tasks, robots need to accurately interpret visual cues and respond effectively in diverse safety-critical situations, such as sharp objects on the floor. In this context, we present M-CoDAL, a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations. The system leverages discourse coherence relations to enhance its contextual understanding and communication abilities. To train this system, we introduce a novel clustering-based active learning mechanism that utilizes an external Large Language Model (LLM) to identify informative instances. Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images. These violations are annotated using a Large Multimodal Model (LMM) and verified by human annotators. Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation. Next, we deploy our dialogue system on a Hello Robot Stretch robot and conduct a within-subject user study with real-world participants. In the study, participants role-play two safety scenarios with different levels of severity with the robot and receive interventions from our model and a baseline system powered by OpenAI's ChatGPT. The study results corroborate and extend the findings from automated evaluation, showing that our proposed system is more persuasive and competent in a real-world embodied agent setting.

Related papers

UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations [71.79210031338464]
We show how to unify dense retrieval and response generation for large language models in conversation.<n>We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks.<n>The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
arXiv Detail & Related papers (2025-07-09T17:02:40Z)
An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems [40.53603737069306]
A multi-agent AI system (MAS) is composed of multiple autonomous agents that interact, exchange information, and make decisions based on internal generative models.<n>This paper outlines a formal framework for analyzing MAS, focusing on two core aspects: effectiveness and safety.
arXiv Detail & Related papers (2025-05-23T22:05:19Z)
Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines [9.834055425277874]
This study investigates learner-AI interactions through an educational experiment in which participants receive structured guidance on effective prompting. To assess user behavior and prompting efficacy, we analyze a dataset of 642 interactions from 107 users. Our findings provide a deeper understanding of how users engage with Large Language Models and the role of structured prompting guidance in enhancing AI-assisted communication.
arXiv Detail & Related papers (2025-04-10T15:20:43Z)
Enhancing Explainability with Multimodal Context Representations for Smarter Robots [0.0]
Key issue in Human-Robot Interaction is enabling robots to effectively perceive and reason over multimodal inputs, such as audio and vision. We propose a generalized and explainable multimodal framework for context representation, designed to improve the fusion of speech and vision modalities.
arXiv Detail & Related papers (2025-02-28T13:36:47Z)
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment [2.9775785740619254]
Large Language Models (LLMs) have demonstrated powerful capabilities that render them valuable in different applications, including conversational AI products. It is paramount to ensure the security and reliability of these products by mitigating their vulnerabilities towards malicious user interactions. We present a study on the efficacy of fine-tuning and aligning Chain-of-Thought (CoT) responses of different LLMs that serve as input moderation guardrails.
arXiv Detail & Related papers (2025-01-22T18:40:57Z)
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [97.82118821263825]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts. Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z)
Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios. By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot. We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z)
HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions [76.42274173122328]
We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions. We run 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education) Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50% cases.
arXiv Detail & Related papers (2024-09-24T19:47:21Z)
A Study on Prompt Injection Attack Against LLM-Integrated Mobile Robotic Systems [4.71242457111104]
Large Language Models (LLMs) can process multi-modal prompts, enabling them to generate more context-aware responses. One of the primary concerns is the potential security risks associated with using LLMs in robotic navigation tasks. This study investigates the impact of prompt injections on mobile robot performance in LLM-integrated systems.
arXiv Detail & Related papers (2024-08-07T02:48:22Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
A Survey on Context-Aware Multi-Agent Systems: Techniques, Challenges and Future Directions [1.1458366773578277]
Research interest in autonomous agents is on the rise as an emerging topic. The challenge lies in enabling these agents to learn, reason, and navigate uncertainties in dynamic environments. Context awareness emerges as a pivotal element in fortifying multi-agent systems.
arXiv Detail & Related papers (2024-02-03T00:27:22Z)
Empowering Autonomous Driving with Large Language Models: A Safety Perspective [82.90376711290808]
This paper explores the integration of Large Language Models (LLMs) into Autonomous Driving systems. LLMs are intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning. We present two key studies in a simulated environment: an adaptive LLM-conditioned Model Predictive Control (MPC) and an LLM-enabled interactive behavior planning scheme with a state machine.
arXiv Detail & Related papers (2023-11-28T03:13:09Z)
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z)
Realistic simulation of users for IT systems in cyber ranges [63.20765930558542]
We instrument each machine by means of an external agent to generate user activity. This agent combines both deterministic and deep learning based methods to adapt to different environment. We also propose conditional text generation models to facilitate the creation of conversations and documents.
arXiv Detail & Related papers (2021-11-23T10:53:29Z)
Open-Ended Multi-Modal Relational Reasoning for Video Question Answering [1.8699569122464073]
The primary focus of this agent is to assist individuals using language-based interactions within video-based scenes. Our proposed method integrates video recognition technology and natural language processing models within the robotic agent.
arXiv Detail & Related papers (2020-12-01T20:49:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.