A Survey on Improving Human Robot Collaboration through Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2512.00027v1
- Date: Thu, 06 Nov 2025 07:52:56 GMT
- Title: A Survey on Improving Human Robot Collaboration through Vision-and-Language Navigation
- Authors: Nivedan Yakolli, Avinash Gautam, Abhijit Das, Yuankai Qi, Virendra Singh Shekhawat,
- Abstract summary: Vision-and-Language Navigation (VLN) is a multi-modal, cooperative task requiring agents to interpret human instructions, navigate 3D environments, and communicate effectively under ambiguity.<n>Despite progress, current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in the multi-agent systems.<n>We advocate that the future VLN systems should support proactive clarification, real-time feedback, and contextual reasoning through advanced natural language understanding (NLU) techniques.
- Score: 15.559371441886363
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-Language Navigation (VLN) is a multi-modal, cooperative task requiring agents to interpret human instructions, navigate 3D environments, and communicate effectively under ambiguity. This paper presents a comprehensive review of recent VLN advancements in robotics and outlines promising directions to improve multi-robot coordination. Despite progress, current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in the multi-agent systems. We review approximately 200 relevant articles to provide an in-depth understanding of the current landscape. Through this survey, we aim to provide a thorough resource that inspires further research at the intersection of VLN and robotics. We advocate that the future VLN systems should support proactive clarification, real-time feedback, and contextual reasoning through advanced natural language understanding (NLU) techniques. Additionally, decentralized decision-making frameworks with dynamic role assignment are essential for scalable, efficient multi-robot collaboration. These innovations can significantly enhance human-robot interaction (HRI) and enable real-world deployment in domains such as healthcare, logistics, and disaster response.
Related papers
- Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge [170.47383225329915]
Multi-agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions.<n>This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human-agent interactions.<n>We propose the Multi-Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE.
arXiv Detail & Related papers (2026-01-26T17:56:19Z) - Intuitive Programming, Adaptive Task Planning, and Dynamic Role Allocation in Human-Robot Collaboration [28.96294944351617]
This review identifies and connects key components enabling intuitive information exchange and skill transfer between humans and robots.<n>We examine the full interaction pipeline: from the human-to-robot communication bridge translating multimodal inputs into robot-understandable representations.
arXiv Detail & Related papers (2025-11-11T19:49:30Z) - Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [49.073964142139495]
We systematically review the applications and advancements of multimodal fusion methods and vision-language models.<n>For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks.<n>We identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation.
arXiv Detail & Related papers (2025-04-03T10:53:07Z) - Enhancing Explainability with Multimodal Context Representations for Smarter Robots [0.0]
Key issue in Human-Robot Interaction is enabling robots to effectively perceive and reason over multimodal inputs, such as audio and vision.<n>We propose a generalized and explainable multimodal framework for context representation, designed to improve the fusion of speech and vision modalities.
arXiv Detail & Related papers (2025-02-28T13:36:47Z) - TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models [1.534667887016089]
The presented paper investigates recent advancements in Large Language Models (LLMs) and Vision Language Models (VLMs)<n>This integration allows robots to understand and execute commands given in natural language and to perceive their environment through visual and/or descriptive inputs.<n>Our paper outlines four LLM-assisted simulated robotic control, which explore (i) low-level control, (ii) the generation of language-based feedback that describes the robot's internal states, (iii) the use of visual information as additional input, and (iv) the use of robot structure information for generating task plans and feedback.
arXiv Detail & Related papers (2024-12-19T23:43:40Z) - Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models [81.55156507635286]
Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions.
Current learning methods often struggle with generalization to the long tail of unexpected situations without heavy human supervision.
We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection.
arXiv Detail & Related papers (2024-07-02T21:00:30Z) - Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions [69.9980759344628]
Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions.
We introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities.
We present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies.
arXiv Detail & Related papers (2024-06-27T15:01:42Z) - Position: Towards Bidirectional Human-AI Alignment [109.57781720848669]
We argue that the research community should explicitly define and critically reflect on "alignment" to account for the bidirectional and dynamic relationship between humans and AI.<n>We introduce the Bidirectional Human-AI Alignment framework, which not only incorporates traditional efforts to align AI with human values but also introduces the critical, underexplored dimension of aligning humans with AI.
arXiv Detail & Related papers (2024-06-13T16:03:25Z) - Interactive Natural Language Processing [67.87925315773924]
Interactive Natural Language Processing (iNLP) has emerged as a novel paradigm within the field of NLP.
This paper offers a comprehensive survey of iNLP, starting by proposing a unified definition and framework of the concept.
arXiv Detail & Related papers (2023-05-22T17:18:29Z) - Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future
Directions [23.389491536958772]
Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal.
VLN receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities.
This paper serves as a thorough reference for the VLN research community.
arXiv Detail & Related papers (2022-03-22T16:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.