RREx-BoT: Remote Referring Expressions with a Bag of Tricks
- URL: http://arxiv.org/abs/2301.12614v1
- Date: Mon, 30 Jan 2023 02:19:19 GMT
- Title: RREx-BoT: Remote Referring Expressions with a Bag of Tricks
- Authors: Gunnar A. Sigurdsson, Jesse Thomason, Gaurav S. Sukhatme, Robinson
Piramuthu
- Abstract summary: We show how a vision-language scoring model can be used to locate objects in unobserved environments.
We demonstrate our model on a real-world TurtleBot platform, highlighting the simplicity and usefulness of the approach.
Our analysis outlines a "bag of tricks" essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.
- Score: 19.036557405184656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Household robots operate in the same space for years. Such robots
incrementally build dynamic maps that can be used for tasks requiring remote
object localization. However, benchmarks in robot learning often test
generalization through inference on tasks in unobserved environments. In an
observed environment, locating an object is reduced to choosing from among all
object proposals in the environment, which may number in the 100,000s. Armed
with this intuition, using only a generic vision-language scoring model with
minor modifications for 3d encoding and operating in an embodied environment,
we demonstrate an absolute performance gain of 9.84% on remote object grounding
above state of the art models for REVERIE and of 5.04% on FAO. When allowed to
pre-explore an environment, we also exceed the previous state of the art
pre-exploration method on REVERIE. Additionally, we demonstrate our model on a
real-world TurtleBot platform, highlighting the simplicity and usefulness of
the approach. Our analysis outlines a "bag of tricks" essential for
accomplishing this task, from utilizing 3d coordinates and context, to
generalizing vision-language models to large 3d search spaces.
Related papers
- PickScan: Object discovery and reconstruction from handheld interactions [99.99566882133179]
We develop an interaction-guided and class-agnostic method to reconstruct 3D representations of scenes.
Our main contribution is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects.
Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73%.
arXiv Detail & Related papers (2024-11-17T23:09:08Z) - M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes [66.44171200767839]
We propose M3Bench, a new benchmark of whole-body motion generation for mobile manipulation tasks.
M3Bench requires an embodied agent to understand its configuration, environmental constraints and task objectives.
M3Bench features 30k object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M3BenchMaker.
arXiv Detail & Related papers (2024-10-09T08:38:21Z) - ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments [13.988804095409133]
We propose the ReALFRED benchmark that employs real-world scenes, objects, and room layouts to learn agents to complete household tasks.
Specifically, we extend the ALFRED benchmark with updates for larger environmental spaces with smaller visual domain gaps.
With ReALFRED, we analyze previously crafted methods for the ALFRED benchmark and observe that they consistently yield lower performance in all metrics.
arXiv Detail & Related papers (2024-07-26T07:00:27Z) - Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps [16.083092305930844]
Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots.
We propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities.
We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments.
arXiv Detail & Related papers (2024-06-26T07:06:42Z) - Lifelong Ensemble Learning based on Multiple Representations for
Few-Shot Object Recognition [6.282068591820947]
We present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem.
To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly.
We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios.
arXiv Detail & Related papers (2022-05-04T10:29:10Z) - Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges.
We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible.
Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Maintaining a Reliable World Model using Action-aware Perceptual
Anchoring [4.971403153199917]
There is a need for robots to maintain a model of its surroundings even when objects go out of view and are no longer visible.
This requires anchoring perceptual information onto symbols that represent the objects in the environment.
We present a model for action-aware perceptual anchoring that enables robots to track objects in a persistent manner.
arXiv Detail & Related papers (2021-07-07T06:35:14Z) - Rapid Exploration for Open-World Navigation with Latent Goal Models [78.45339342966196]
We describe a robotic learning system for autonomous exploration and navigation in diverse, open-world environments.
At the core of our method is a learned latent variable model of distances and actions, along with a non-parametric topological memory of images.
We use an information bottleneck to regularize the learned policy, giving us (i) a compact visual representation of goals, (ii) improved generalization capabilities, and (iii) a mechanism for sampling feasible goals for exploration.
arXiv Detail & Related papers (2021-04-12T23:14:41Z) - Supervised Training of Dense Object Nets using Optimal Descriptors for
Industrial Robotic Applications [57.87136703404356]
Dense Object Nets (DONs) by Florence, Manuelli and Tedrake introduced dense object descriptors as a novel visual object representation for the robotics community.
In this paper we show that given a 3D model of an object, we can generate its descriptor space image, which allows for supervised training of DONs.
We compare the training methods on generating 6D grasps for industrial objects and show that our novel supervised training approach improves the pick-and-place performance in industry-relevant tasks.
arXiv Detail & Related papers (2021-02-16T11:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.