Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation
- URL: http://arxiv.org/abs/2202.02440v1
- Date: Sat, 5 Feb 2022 00:07:21 GMT
- Title: Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation
- Authors: Ziad Al-Halah, Santhosh K. Ramakrishnan, Kristen Grauman
- Abstract summary: We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
- Score: 97.17517060585875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning for visual navigation, it is common to develop a
model for each new task, and train that model from scratch with task-specific
interactions in 3D environments. However, this process is expensive; massive
amounts of interactions are needed for the model to generalize well. Moreover,
this process is repeated whenever there is a change in the task type or the
goal modality. We present a unified approach to visual navigation using a novel
modular transfer learning model. Our model can effectively leverage its
experience from one source task and apply it to multiple target tasks (e.g.,
ObjectNav, RoomNav, ViewNav) with various goal modalities (e.g., image, sketch,
audio, label). Furthermore, our model enables zero-shot experience learning,
whereby it can solve the target tasks without receiving any task-specific
interactive training. Our experiments on multiple photorealistic datasets and
challenging tasks show that our approach learns faster, generalizes better, and
outperforms SoTA models by a significant margin.
Related papers
- SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks [24.690910258151693]
Existing models for embodied navigation fall short of serving as practical generalists in the real world.
We present Uni-NaVid, the first video-based vision-language-action model designed to unify diverse embodied navigation tasks.
Uni-NaVid achieves this by the input and output data configurations for all commonly used embodied navigation tasks.
arXiv Detail & Related papers (2024-12-09T05:55:55Z) - LIMT: Language-Informed Multi-Task Visual World Models [6.128332310539627]
Multi-task reinforcement learning can be very challenging due to the increased sample complexity and the potentially conflicting task objectives.
We propose a method for learning multi-task visual world models, leveraging pre-trained language models to extract semantically meaningful task representations.
Our results highlight the benefits of using language-driven task representations for world models and a clear advantage of model-based multi-task learning over the more common model-free paradigm.
arXiv Detail & Related papers (2024-07-18T12:40:58Z) - MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.
We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.
To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching.
Our approach learns entirely using offline, unlabeled data.
We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.