Transferring ConvNet Features from Passive to Active Robot
Self-Localization: The Use of Ego-Centric and World-Centric Views
- URL: http://arxiv.org/abs/2204.10497v1
- Date: Fri, 22 Apr 2022 04:42:33 GMT
- Title: Transferring ConvNet Features from Passive to Active Robot
Self-Localization: The Use of Ego-Centric and World-Centric Views
- Authors: Kanya Kurauchi, Kanji Tanaka, Ryogo Yamamoto, and Mitsuki Yoshida
- Abstract summary: A standard VPR subsystem is assumed to be available, and its domain-invariant state recognition ability is proposed to be transferred to train the domain-invariant NBV planner.
We divide the visual cues that are available from the CNN model into two types: the output layer cue (OLC) and intermediate layer cue (ILC)
In our framework, the ILC and OLC are mapped to a state vector and subsequently used to train a multiview NBV planner via deep reinforcement learning.
- Score: 2.362412515574206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The training of a next-best-view (NBV) planner for visual place recognition
(VPR) is a fundamentally important task in autonomous robot navigation, for
which a typical approach is the use of visual experiences that are collected in
the target domain as training data. However, the collection of a wide variety
of visual experiences in everyday navigation is costly and prohibitive for
real-time robotic applications. We address this issue by employing a novel {\it
domain-invariant} NBV planner. A standard VPR subsystem based on a
convolutional neural network (CNN) is assumed to be available, and its
domain-invariant state recognition ability is proposed to be transferred to
train the domain-invariant NBV planner. Specifically, we divide the visual cues
that are available from the CNN model into two types: the output layer cue
(OLC) and intermediate layer cue (ILC). The OLC is available at the output
layer of the CNN model and aims to estimate the state of the robot (e.g., the
robot viewpoint) with respect to the world-centric view coordinate system. The
ILC is available within the middle layers of the CNN model as a high-level
description of the visual content (e.g., a saliency image) with respect to the
ego-centric view. In our framework, the ILC and OLC are mapped to a state
vector and subsequently used to train a multiview NBV planner via deep
reinforcement learning. Experiments using the public NCLT dataset validate the
effectiveness of the proposed method.
Related papers
- UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN.
It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features.
UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z) - DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - Learning Object-Centric Representation via Reverse Hierarchy Guidance [73.05170419085796]
Object-Centric Learning (OCL) seeks to enable Neural Networks to identify individual objects in visual scenes.
RHGNet introduces a top-down pathway that works in different ways in the training and inference processes.
Our model achieves SOTA performance on several commonly used datasets.
arXiv Detail & Related papers (2024-05-17T07:48:27Z) - Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training [8.479135285935113]
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation.
Most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects.
We propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP)
arXiv Detail & Related papers (2024-03-12T22:33:08Z) - Interactive Semantic Map Representation for Skill-based Visual Object
Navigation [43.71312386938849]
This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment.
We have implemented this representation into a full-fledged navigation approach called SkillTron.
The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation.
arXiv Detail & Related papers (2023-11-07T16:30:12Z) - ViNT: A Foundation Model for Visual Navigation [52.2571739391896]
Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation.
ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset.
It exhibits positive transfer, outperforming specialist models trained on singular datasets.
arXiv Detail & Related papers (2023-06-26T16:57:03Z) - Polyline Based Generative Navigable Space Segmentation for Autonomous
Visual Navigation [57.3062528453841]
We propose a representation-learning-based framework to enable robots to learn the navigable space segmentation in an unsupervised manner.
We show that the proposed PSV-Nets can learn the visual navigable space with high accuracy, even without any single label.
arXiv Detail & Related papers (2021-10-29T19:50:48Z) - Scalable Perception-Action-Communication Loops with Convolutional and
Graph Neural Networks [208.15591625749272]
We present a perception-action-communication loop design using Vision-based Graph Aggregation and Inference (VGAI)
Our framework is implemented by a cascade of a convolutional and a graph neural network (CNN / GNN), addressing agent-level visual perception and feature learning.
We demonstrate that VGAI yields performance comparable to or better than other decentralized controllers.
arXiv Detail & Related papers (2021-06-24T23:57:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.