Related papers: Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

URL: http://arxiv.org/abs/2202.02440v1
Date: Sat, 5 Feb 2022 00:07:21 GMT
Title: Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation
Authors: Ziad Al-Halah, Santhosh K. Ramakrishnan, Kristen Grauman
Abstract summary: We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks. Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
Score: 97.17517060585875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality. We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks (e.g., ObjectNav, RoomNav, ViewNav) with various goal modalities (e.g., image, sketch, audio, label). Furthermore, our model enables zero-shot experience learning, whereby it can solve the target tasks without receiving any task-specific interactive training. Our experiments on multiple photorealistic datasets and challenging tasks show that our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.

Related papers

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing. It is designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks [24.690910258151693]
Existing models for embodied navigation fall short of serving as practical generalists in the real world. We present Uni-NaVid, the first video-based vision-language-action model designed to unify diverse embodied navigation tasks. Uni-NaVid achieves this by the input and output data configurations for all commonly used embodied navigation tasks.
arXiv Detail & Related papers (2024-12-09T05:55:55Z)
LIMT: Language-Informed Multi-Task Visual World Models [6.128332310539627]
Multi-task reinforcement learning can be very challenging due to the increased sample complexity and the potentially conflicting task objectives. We propose a method for learning multi-task visual world models, leveraging pre-trained language models to extract semantically meaningful task representations. Our results highlight the benefits of using language-driven task representations for world models and a clear advantage of model-based multi-task learning over the more common model-free paradigm.
arXiv Detail & Related papers (2024-07-18T12:40:58Z)
MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work. We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level. To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z)
Towards Learning a Generalist Model for Embodied Navigation [24.816490551945435]
We propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. We conduct extensive experiments to evaluate the performance and generalizability of our model.
arXiv Detail & Related papers (2023-12-04T16:32:51Z)
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z)
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z)
Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching. Our approach learns entirely using offline, unlabeled data. We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.