Related papers: RREx-BoT: Remote Referring Expressions with a Bag of Tricks

RREx-BoT: Remote Referring Expressions with a Bag of Tricks

URL: http://arxiv.org/abs/2301.12614v1
Date: Mon, 30 Jan 2023 02:19:19 GMT
Title: RREx-BoT: Remote Referring Expressions with a Bag of Tricks
Authors: Gunnar A. Sigurdsson, Jesse Thomason, Gaurav S. Sukhatme, Robinson Piramuthu
Abstract summary: We show how a vision-language scoring model can be used to locate objects in unobserved environments. We demonstrate our model on a real-world TurtleBot platform, highlighting the simplicity and usefulness of the approach. Our analysis outlines a "bag of tricks" essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.
Score: 19.036557405184656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Household robots operate in the same space for years. Such robots incrementally build dynamic maps that can be used for tasks requiring remote object localization. However, benchmarks in robot learning often test generalization through inference on tasks in unobserved environments. In an observed environment, locating an object is reduced to choosing from among all object proposals in the environment, which may number in the 100,000s. Armed with this intuition, using only a generic vision-language scoring model with minor modifications for 3d encoding and operating in an embodied environment, we demonstrate an absolute performance gain of 9.84% on remote object grounding above state of the art models for REVERIE and of 5.04% on FAO. When allowed to pre-explore an environment, we also exceed the previous state of the art pre-exploration method on REVERIE. Additionally, we demonstrate our model on a real-world TurtleBot platform, highlighting the simplicity and usefulness of the approach. Our analysis outlines a "bag of tricks" essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.

Related papers

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics [54.441878000440965]
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world.<n>We propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding.<n>RFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%.
arXiv Detail & Related papers (2025-06-04T17:59:27Z)
SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models [9.568997654206823]
SORT3D is an approach that utilizes rich object attributes from 2D data and merges as-based spatial reasoning toolbox with the ability of large language models. We show that SORT3D achieves state-of-the-art performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on an autonomous vehicle and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments.
arXiv Detail & Related papers (2025-04-25T20:24:11Z)
PickScan: Object discovery and reconstruction from handheld interactions [99.99566882133179]
We develop an interaction-guided and class-agnostic method to reconstruct 3D representations of scenes. Our main contribution is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects. Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73%.
arXiv Detail & Related papers (2024-11-17T23:09:08Z)
M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes [66.44171200767839]
We propose M3Bench, a new benchmark of whole-body motion generation for mobile manipulation tasks. M3Bench requires an embodied agent to understand its configuration, environmental constraints and task objectives. M3Bench features 30k object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M3BenchMaker.
arXiv Detail & Related papers (2024-10-09T08:38:21Z)
ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments [13.988804095409133]
We propose the ReALFRED benchmark that employs real-world scenes, objects, and room layouts to learn agents to complete household tasks. Specifically, we extend the ALFRED benchmark with updates for larger environmental spaces with smaller visual domain gaps. With ReALFRED, we analyze previously crafted methods for the ALFRED benchmark and observe that they consistently yield lower performance in all metrics.
arXiv Detail & Related papers (2024-07-26T07:00:27Z)
Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps [16.083092305930844]
Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots. We propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities. We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments.
arXiv Detail & Related papers (2024-06-26T07:06:42Z)
Lifelong Ensemble Learning based on Multiple Representations for Few-Shot Object Recognition [6.282068591820947]
We present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem. To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly. We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios.
arXiv Detail & Related papers (2022-05-04T10:29:10Z)
Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges. We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible. Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z)
LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
Maintaining a Reliable World Model using Action-aware Perceptual Anchoring [4.971403153199917]
There is a need for robots to maintain a model of its surroundings even when objects go out of view and are no longer visible. This requires anchoring perceptual information onto symbols that represent the objects in the environment. We present a model for action-aware perceptual anchoring that enables robots to track objects in a persistent manner.
arXiv Detail & Related papers (2021-07-07T06:35:14Z)
Rapid Exploration for Open-World Navigation with Latent Goal Models [78.45339342966196]
We describe a robotic learning system for autonomous exploration and navigation in diverse, open-world environments. At the core of our method is a learned latent variable model of distances and actions, along with a non-parametric topological memory of images. We use an information bottleneck to regularize the learned policy, giving us (i) a compact visual representation of goals, (ii) improved generalization capabilities, and (iii) a mechanism for sampling feasible goals for exploration.
arXiv Detail & Related papers (2021-04-12T23:14:41Z)
Supervised Training of Dense Object Nets using Optimal Descriptors for Industrial Robotic Applications [57.87136703404356]
Dense Object Nets (DONs) by Florence, Manuelli and Tedrake introduced dense object descriptors as a novel visual object representation for the robotics community. In this paper we show that given a 3D model of an object, we can generate its descriptor space image, which allows for supervised training of DONs. We compare the training methods on generating 6D grasps for industrial objects and show that our novel supervised training approach improves the pick-and-place performance in industry-relevant tasks.
arXiv Detail & Related papers (2021-02-16T11:40:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.