RREx-BoT: Remote Referring Expressions with a Bag of Tricks
- URL: http://arxiv.org/abs/2301.12614v1
- Date: Mon, 30 Jan 2023 02:19:19 GMT
- Title: RREx-BoT: Remote Referring Expressions with a Bag of Tricks
- Authors: Gunnar A. Sigurdsson, Jesse Thomason, Gaurav S. Sukhatme, Robinson
Piramuthu
- Abstract summary: We show how a vision-language scoring model can be used to locate objects in unobserved environments.
We demonstrate our model on a real-world TurtleBot platform, highlighting the simplicity and usefulness of the approach.
Our analysis outlines a "bag of tricks" essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.
- Score: 19.036557405184656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Household robots operate in the same space for years. Such robots
incrementally build dynamic maps that can be used for tasks requiring remote
object localization. However, benchmarks in robot learning often test
generalization through inference on tasks in unobserved environments. In an
observed environment, locating an object is reduced to choosing from among all
object proposals in the environment, which may number in the 100,000s. Armed
with this intuition, using only a generic vision-language scoring model with
minor modifications for 3d encoding and operating in an embodied environment,
we demonstrate an absolute performance gain of 9.84% on remote object grounding
above state of the art models for REVERIE and of 5.04% on FAO. When allowed to
pre-explore an environment, we also exceed the previous state of the art
pre-exploration method on REVERIE. Additionally, we demonstrate our model on a
real-world TurtleBot platform, highlighting the simplicity and usefulness of
the approach. Our analysis outlines a "bag of tricks" essential for
accomplishing this task, from utilizing 3d coordinates and context, to
generalizing vision-language models to large 3d search spaces.
Related papers
- Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps [16.083092305930844]
Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots.
We propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities.
We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments.
arXiv Detail & Related papers (2024-06-26T07:06:42Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Development and evaluation of automated localisation and reconstruction
of all fruits on tomato plants in a greenhouse based on multi-view perception
and 3D multi-object tracking [0.0]
This paper presents a novel approach for building generic representations in occluded agro-food environments.
It is based on a detection algorithm that generates partial point clouds for each detected object, followed by a 3D multi-object tracking algorithm.
The accuracy of the representation was evaluated in a real-world environment, where successful representation and localisation of tomatoes in tomato plants were achieved.
arXiv Detail & Related papers (2022-11-04T21:51:53Z) - Lifelong Ensemble Learning based on Multiple Representations for
Few-Shot Object Recognition [6.282068591820947]
We present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem.
To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly.
We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios.
arXiv Detail & Related papers (2022-05-04T10:29:10Z) - Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges.
We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible.
Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Maintaining a Reliable World Model using Action-aware Perceptual
Anchoring [4.971403153199917]
There is a need for robots to maintain a model of its surroundings even when objects go out of view and are no longer visible.
This requires anchoring perceptual information onto symbols that represent the objects in the environment.
We present a model for action-aware perceptual anchoring that enables robots to track objects in a persistent manner.
arXiv Detail & Related papers (2021-07-07T06:35:14Z) - Rapid Exploration for Open-World Navigation with Latent Goal Models [78.45339342966196]
We describe a robotic learning system for autonomous exploration and navigation in diverse, open-world environments.
At the core of our method is a learned latent variable model of distances and actions, along with a non-parametric topological memory of images.
We use an information bottleneck to regularize the learned policy, giving us (i) a compact visual representation of goals, (ii) improved generalization capabilities, and (iii) a mechanism for sampling feasible goals for exploration.
arXiv Detail & Related papers (2021-04-12T23:14:41Z) - Supervised Training of Dense Object Nets using Optimal Descriptors for
Industrial Robotic Applications [57.87136703404356]
Dense Object Nets (DONs) by Florence, Manuelli and Tedrake introduced dense object descriptors as a novel visual object representation for the robotics community.
In this paper we show that given a 3D model of an object, we can generate its descriptor space image, which allows for supervised training of DONs.
We compare the training methods on generating 6D grasps for industrial objects and show that our novel supervised training approach improves the pick-and-place performance in industry-relevant tasks.
arXiv Detail & Related papers (2021-02-16T11:40:12Z) - Learning to Move with Affordance Maps [57.198806691838364]
The ability to autonomously explore and navigate a physical space is a fundamental requirement for virtually any mobile autonomous agent.
Traditional SLAM-based approaches for exploration and navigation largely focus on leveraging scene geometry.
We show that learned affordance maps can be used to augment traditional approaches for both exploration and navigation, providing significant improvements in performance.
arXiv Detail & Related papers (2020-01-08T04:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.