Multi-View Dreaming: Multi-View World Model with Contrastive Learning
- URL: http://arxiv.org/abs/2203.11024v1
- Date: Tue, 15 Mar 2022 02:33:31 GMT
- Title: Multi-View Dreaming: Multi-View World Model with Contrastive Learning
- Authors: Akira Kinose, Masashi Okada, Ryo Okumura, Tadahiro Taniguchi
- Abstract summary: Multi-View Dreaming is a novel reinforcement learning agent for integrated recognition and control from multi-view observations.
In this paper, we use contrastive learning to train a shared latent space between different viewpoints.
We also propose Multi-View DreamingV2, a variant of Multi-View Dreaming that uses a categorical distribution to model the latent state instead of the Gaussian distribution.
- Score: 11.259786293913606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose Multi-View Dreaming, a novel reinforcement learning
agent for integrated recognition and control from multi-view observations by
extending Dreaming. Most current reinforcement learning method assumes a
single-view observation space, and this imposes limitations on the observed
data, such as lack of spatial information and occlusions. This makes obtaining
ideal observational information from the environment difficult and is a
bottleneck for real-world robotics applications. In this paper, we use
contrastive learning to train a shared latent space between different
viewpoints, and show how the Products of Experts approach can be used to
integrate and control the probability distributions of latent states for
multiple viewpoints. We also propose Multi-View DreamingV2, a variant of
Multi-View Dreaming that uses a categorical distribution to model the latent
state instead of the Gaussian distribution. Experiments show that the proposed
method outperforms simple extensions of existing methods in a realistic robot
control task.
Related papers
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - Towards Generalized Multi-stage Clustering: Multi-view Self-distillation [10.368796552760571]
Existing multi-stage clustering methods independently learn the salient features from multiple views and then perform the clustering task.
This paper proposes a novel multi-stage deep MVC framework where multi-view self-distillation (DistilMVC) is introduced to distill dark knowledge of label distribution.
arXiv Detail & Related papers (2023-10-29T03:35:34Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - Expanding Frozen Vision-Language Models without Retraining: Towards
Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks.
In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space.
We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z) - Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance.
This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z) - Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration.
We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE.
We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z) - Latent Heterogeneous Graph Network for Incomplete Multi-View Learning [57.49776938934186]
We propose a novel Latent Heterogeneous Graph Network (LHGN) for incomplete multi-view learning.
By learning a unified latent representation, a trade-off between consistency and complementarity among different views is implicitly realized.
To avoid any inconsistencies between training and test phase, a transductive learning technique is applied based on graph learning for classification tasks.
arXiv Detail & Related papers (2022-08-29T15:14:21Z) - MORI-RAN: Multi-view Robust Representation Learning via Hybrid
Contrastive Fusion [4.36488705757229]
Multi-view representation learning is essential for many multi-view tasks, such as clustering and classification.
We propose a hybrid contrastive fusion algorithm to extract robust view-common representation from unlabeled data.
Experimental results demonstrated that the proposed method outperforms 12 competitive multi-view methods on four real-world datasets.
arXiv Detail & Related papers (2022-08-26T09:58:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.