Language-Conditioned Change-point Detection to Identify Sub-Tasks in
Robotics Domains
- URL: http://arxiv.org/abs/2309.00743v1
- Date: Fri, 1 Sep 2023 21:40:34 GMT
- Title: Language-Conditioned Change-point Detection to Identify Sub-Tasks in
Robotics Domains
- Authors: Divyanshu Raj, Chitta Baral, Nakul Gopalan
- Abstract summary: We identify sub-tasks within a demonstrated robot trajectory using language instructions.
We propose a language-conditioned change-point detection method to identify sub-tasks in a problem.
- Score: 43.96051384180866
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this work, we present an approach to identify sub-tasks within a
demonstrated robot trajectory using language instructions. We identify these
sub-tasks using language provided during demonstrations as guidance to identify
sub-segments of a longer robot trajectory. Given a sequence of natural language
instructions and a long trajectory consisting of image frames and discrete
actions, we want to map an instruction to a smaller fragment of the trajectory.
Unlike previous instruction following works which directly learn the mapping
from language to a policy, we propose a language-conditioned change-point
detection method to identify sub-tasks in a problem. Our approach learns the
relationship between constituent segments of a long language command and
corresponding constituent segments of a trajectory. These constituent
trajectory segments can be used to learn subtasks or sub-goals for planning or
options as demonstrated by previous related work. Our insight in this work is
that the language-conditioned robot change-point detection problem is similar
to the existing video moment retrieval works used to identify sub-segments
within online videos. Through extensive experimentation, we demonstrate a
$1.78_{\pm 0.82}\%$ improvement over a baseline approach in accurately
identifying sub-tasks within a trajectory using our proposed method. Moreover,
we present a comprehensive study investigating sample complexity requirements
on learning this mapping, between language and trajectory sub-segments, to
understand if the video retrieval-based methods are realistic in real robot
scenarios.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Language-driven Grasp Detection with Mask-guided Attention [10.231956034184265]
We propose a new method for language-driven grasp detection with mask-guided attention.
Our approach integrates visual data, segmentation mask features, and natural language instructions.
Our work introduces a new framework for language-driven grasp detection, paving the way for language-driven robotic applications.
arXiv Detail & Related papers (2024-07-29T10:55:17Z) - Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy.
We first demonstrate that ICL-based segmentation models are sensitive to different contexts.
Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z) - MENTOR: Multilingual tExt detectioN TOward leaRning by analogy [59.37382045577384]
We propose a framework to detect and identify both seen and unseen language regions inside scene images.
"MENTOR" is the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection.
arXiv Detail & Related papers (2024-03-12T03:35:17Z) - Remote Task-oriented Grasp Area Teaching By Non-Experts through
Interactive Segmentation and Few-Shot Learning [0.0]
A robot must be able to discriminate between different grasping styles depending on the prospective manipulation task.
We propose a novel two-step framework towards this aim.
We receive grasp area demonstrations for a new task via interactive segmentation.
We learn from these few demonstrations to estimate the required grasp area on an unseen scene for the given task.
arXiv Detail & Related papers (2023-03-17T18:09:01Z) - Find a Way Forward: a Language-Guided Semantic Map Navigator [53.69229615952205]
This paper attacks the problem of language-guided navigation in a new perspective.
We use novel semantic navigation maps, which enables robots to carry out natural language instructions and move to a target position based on the map observations.
The proposed approach has noticeable performance gains, especially in long-distance navigation cases.
arXiv Detail & Related papers (2022-03-07T07:40:33Z) - A Persistent Spatial Semantic Representation for High-level Natural
Language Instruction Execution [54.385344986265714]
We propose a persistent spatial semantic representation method to bridge the gap between language and robot actions.
We evaluate our approach on the ALFRED benchmark and achieve state-of-the-art results, despite completely avoiding the commonly used step-by-step instructions.
arXiv Detail & Related papers (2021-07-12T17:47:19Z) - Are We There Yet? Learning to Localize in Embodied Instruction Following [1.7300690315775575]
Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem.
Key challenges for this task include localizing target locations and navigating to them through visual inputs.
We augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep.
arXiv Detail & Related papers (2021-01-09T21:49:41Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.