Learning-To-Rank Approach for Identifying Everyday Objects Using a
Physical-World Search Engine
- URL: http://arxiv.org/abs/2312.15844v1
- Date: Tue, 26 Dec 2023 01:40:31 GMT
- Title: Learning-To-Rank Approach for Identifying Everyday Objects Using a
Physical-World Search Engine
- Authors: Kanta Kaneda, Shunya Nagashima, Ryosuke Korekata, Motonari Kambara and
Komei Sugiura
- Abstract summary: We focus on the task of retrieving target objects from open-vocabulary user instructions in a human-in-the-loop setting.
We propose MultiRankIt, which is a novel approach for the learning-to-rank physical objects task.
- Score: 0.8749675983608172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Domestic service robots offer a solution to the increasing demand for daily
care and support. A human-in-the-loop approach that combines automation and
operator intervention is considered to be a realistic approach to their use in
society. Therefore, we focus on the task of retrieving target objects from
open-vocabulary user instructions in a human-in-the-loop setting, which we
define as the learning-to-rank physical objects (LTRPO) task. For example,
given the instruction "Please go to the dining room which has a round table.
Pick up the bottle on it," the model is required to output a ranked list of
target objects that the operator/user can select. In this paper, we propose
MultiRankIt, which is a novel approach for the LTRPO task. MultiRankIt
introduces the Crossmodal Noun Phrase Encoder to model the relationship between
phrases that contain referring expressions and the target bounding box, and the
Crossmodal Region Feature Encoder to model the relationship between the target
object and multiple images of its surrounding contextual environment.
Additionally, we built a new dataset for the LTRPO task that consists of
instructions with complex referring expressions accompanied by real indoor
environmental images that feature various target objects. We validated our
model on the dataset and it outperformed the baseline method in terms of the
mean reciprocal rank and recall@k. Furthermore, we conducted physical
experiments in a setting where a domestic service robot retrieved everyday
objects in a standardized domestic environment, based on users' instruction in
a human--in--the--loop setting. The experimental results demonstrate that the
success rate for object retrieval achieved 80%. Our code is available at
https://github.com/keio-smilab23/MultiRankIt.
Related papers
- Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration [57.15811390835294]
This paper describes how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration.
We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments.
Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods.
arXiv Detail & Related papers (2023-10-11T21:07:14Z) - Switching Head-Tail Funnel UNITER for Dual Referring Expression
Comprehension with Fetch-and-Carry Tasks [3.248019437833647]
This paper describes a domestic service robot (DSR) that fetches everyday objects and carries them to specified destinations according to free-form natural language instructions.
Most of the existing multimodal language understanding methods are impractical in terms of computational complexity.
We propose Switching Head-Tail Funnel UNITER, which solves the task by predicting the target object and the destination individually using a single model.
arXiv Detail & Related papers (2023-07-14T05:27:56Z) - Lifelong Ensemble Learning based on Multiple Representations for
Few-Shot Object Recognition [6.282068591820947]
We present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem.
To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly.
We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios.
arXiv Detail & Related papers (2022-05-04T10:29:10Z) - Target-dependent UNITER: A Transformer-Based Multimodal Language
Comprehension Model for Domestic Service Robots [0.0]
We propose Target-dependent UNITER, which learns the relationship between the target object and other objects directly by focusing on the relevant regions within an image.
Our method is an extension of the UNITER-based Transformer that can be pretrained on general-purpose datasets.
Our model is validated on two standard datasets, and the results show that Target-dependent UNITER outperforms the baseline method in terms of classification accuracy.
arXiv Detail & Related papers (2021-07-02T03:11:02Z) - Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads.
We explore various attention-based contexts, such as global and local, in the multi-task setting.
We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z) - ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in
Dynamic Environments [85.81157224163876]
We combine Vision-and-Language Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ArraMon.
During this task, the agent is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment.
We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.
arXiv Detail & Related papers (2020-11-15T23:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.