doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation
- URL: http://arxiv.org/abs/2412.05893v1
- Date: Sun, 08 Dec 2024 11:16:47 GMT
- Title: doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation
- Authors: Parthib Roy, Srinivasa Perisetla, Shashank Shriram, Harsha Krishnaswamy, Aryan Keskar, Ross Greer,
- Abstract summary: doScenes is a novel dataset designed to facilitate research on human-vehicle instruction interactions.
DoScenes bridges the gap between instruction and driving response, enabling context-aware and adaptive planning.
- Score: 0.0
- License:
- Abstract: Human-interactive robotic systems, particularly autonomous vehicles (AVs), must effectively integrate human instructions into their motion planning. This paper introduces doScenes, a novel dataset designed to facilitate research on human-vehicle instruction interactions, focusing on short-term directives that directly influence vehicle motion. By annotating multimodal sensor data with natural language instructions and referentiality tags, doScenes bridges the gap between instruction and driving response, enabling context-aware and adaptive planning. Unlike existing datasets that focus on ranking or scene-level reasoning, doScenes emphasizes actionable directives tied to static and dynamic scene objects. This framework addresses limitations in prior research, such as reliance on simulated data or predefined action sets, by supporting nuanced and flexible responses in real-world scenarios. This work lays the foundation for developing learning strategies that seamlessly integrate human instructions into autonomous systems, advancing safe and effective human-vehicle collaboration for vision-language navigation. We make our data publicly available at https://www.github.com/rossgreer/doScenes
Related papers
- Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration [28.825612240280822]
We propose a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control.
Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions.
We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation.
arXiv Detail & Related papers (2025-02-20T18:17:11Z) - Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios.
By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot.
We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot
Learning in Natural Human-Robot Interaction [19.65778558341053]
Speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing.
We introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures.
We demonstrate its effectiveness in training robots to understand tasks through multimodal human commands.
arXiv Detail & Related papers (2024-03-04T18:02:41Z) - A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning [70.14372215250535]
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments.
Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding.
We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory.
The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
arXiv Detail & Related papers (2022-10-06T17:59:08Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Reshaping Robot Trajectories Using Natural Language Commands: A Study of
Multi-Modal Data Alignment Using Transformers [33.7939079214046]
We provide a flexible language-based interface for human-robot collaboration.
We take advantage of recent advancements in the field of large language models to encode the user command.
We train the model using imitation learning over a dataset containing robot trajectories modified by language commands.
arXiv Detail & Related papers (2022-03-25T01:36:56Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.