Language-Conditioned Observation Models for Visual Object Search
- URL: http://arxiv.org/abs/2309.07276v1
- Date: Wed, 13 Sep 2023 19:30:53 GMT
- Title: Language-Conditioned Observation Models for Visual Object Search
- Authors: Thao Nguyen, Vladislav Hrosinkov, Eric Rosen, Stefanie Tellex
- Abstract summary: We bridge the gap in realistic object search by posing the problem as a partially observable Markov decision process (POMDP)
We incorporate the neural network's outputs into our language-conditioned observation model (LCOM) to represent dynamically changing sensor noise.
We demonstrate our method on a Boston Dynamics Spot robot, enabling it to handle complex natural language object descriptions and efficiently find objects in a room-scale environment.
- Score: 12.498575839909334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object search is a challenging task because when given complex language
descriptions (e.g., "find the white cup on the table"), the robot must move its
camera through the environment and recognize the described object. Previous
works map language descriptions to a set of fixed object detectors with
predetermined noise models, but these approaches are challenging to scale
because new detectors need to be made for each object. In this work, we bridge
the gap in realistic object search by posing the search problem as a partially
observable Markov decision process (POMDP) where the object detector and visual
sensor noise in the observation model is determined by a single Deep Neural
Network conditioned on complex language descriptions. We incorporate the neural
network's outputs into our language-conditioned observation model (LCOM) to
represent dynamically changing sensor noise. With an LCOM, any language
description of an object can be used to generate an appropriate object detector
and noise model, and training an LCOM only requires readily available
supervised image-caption datasets. We empirically evaluate our method by
comparing against a state-of-the-art object search algorithm in simulation, and
demonstrate that planning with our observation model yields a significantly
higher average task completion rate (from 0.46 to 0.66) and more efficient and
quicker object search than with a fixed-noise model. We demonstrate our method
on a Boston Dynamics Spot robot, enabling it to handle complex natural language
object descriptions and efficiently find objects in a room-scale environment.
Related papers
- ICGNet: A Unified Approach for Instance-Centric Grasping [42.92991092305974]
We introduce an end-to-end architecture for object-centric grasping.
We show the effectiveness of the proposed method by extensively evaluating it against state-of-the-art methods on synthetic datasets.
arXiv Detail & Related papers (2024-01-18T12:41:41Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Graphical Object-Centric Actor-Critic [55.2480439325792]
We propose a novel object-centric reinforcement learning algorithm combining actor-critic and model-based approaches.
We use a transformer encoder to extract object representations and graph neural networks to approximate the dynamics of an environment.
Our algorithm performs better in a visually complex 3D robotic environment and a 2D environment with compositional structure than the state-of-the-art model-free actor-critic algorithm.
arXiv Detail & Related papers (2023-10-26T06:05:12Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z) - IFOR: Iterative Flow Minimization for Robotic Object Rearrangement [92.97142696891727]
IFOR, Iterative Flow Minimization for Robotic Object Rearrangement, is an end-to-end method for the problem of object rearrangement for unknown objects.
We show that our method applies to cluttered scenes, and in the real world, while training only on synthetic data.
arXiv Detail & Related papers (2022-02-01T20:03:56Z) - Towards Optimal Correlational Object Search [25.355936023640506]
Correlational Object Search POMDP can be solved to produce search strategies that use correlational information.
We conduct experiments using AI2-THOR, a realistic simulator of household environments, as well as YOLOv5, a widely-used object detector.
arXiv Detail & Related papers (2021-10-19T14:03:43Z) - Pix2seq: A Language Modeling Framework for Object Detection [12.788663431798588]
Pix2Seq is a simple and generic framework for object detection.
We train a neural net to perceive the image and generate the desired sequence.
Our approach is based mainly on the intuition that if a neural net knows about where and what the objects are, we just need to teach it how to read them out.
arXiv Detail & Related papers (2021-09-22T17:26:36Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Real-Time Object Detection and Recognition on Low-Compute Humanoid
Robots using Deep Learning [0.12599533416395764]
We describe a novel architecture that enables multiple low-compute NAO robots to perform real-time detection, recognition and localization of objects in its camera view.
The proposed algorithm for object detection and localization is an empirical modification of YOLOv3, based on indoor experiments in multiple scenarios.
The architecture also comprises of an effective end-to-end pipeline to feed the real-time frames from the camera feed to the neural net and use its results for guiding the robot.
arXiv Detail & Related papers (2020-01-20T05:24:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.