Focus On What Matters: Separated Models For Visual-Based RL Generalization
- URL: http://arxiv.org/abs/2410.10834v1
- Date: Sun, 29 Sep 2024 04:37:56 GMT
- Title: Focus On What Matters: Separated Models For Visual-Based RL Generalization
- Authors: Di Zhang, Bowen Lv, Hai Zhang, Feifan Yang, Junqiao Zhao, Hang Yu, Chang Huang, Hongtu Zhou, Chen Ye, Changjun Jiang,
- Abstract summary: Separated Models for Generalization (SMG) is a novel approach that exploits image reconstruction for generalization.
SMG incorporates two additional consistency losses to guide the agent's focus toward task-relevant areas across different scenarios.
Experiments in DMC demonstrate the SOTA performance of SMG in generalization, particularly excelling in video-background settings.
- Score: 16.87505461758058
- License:
- Abstract: A primary challenge for visual-based Reinforcement Learning (RL) is to generalize effectively across unseen environments. Although previous studies have explored different auxiliary tasks to enhance generalization, few adopt image reconstruction due to concerns about exacerbating overfitting to task-irrelevant features during training. Perceiving the pre-eminence of image reconstruction in representation learning, we propose SMG (Separated Models for Generalization), a novel approach that exploits image reconstruction for generalization. SMG introduces two model branches to extract task-relevant and task-irrelevant representations separately from visual observations via cooperatively reconstruction. Built upon this architecture, we further emphasize the importance of task-relevant features for generalization. Specifically, SMG incorporates two additional consistency losses to guide the agent's focus toward task-relevant areas across different scenarios, thereby achieving free from overfitting. Extensive experiments in DMC demonstrate the SOTA performance of SMG in generalization, particularly excelling in video-background settings. Evaluations on robotic manipulation tasks further confirm the robustness of SMG in real-world applications.
Related papers
- Varformer: Adapting VAR's Generative Prior for Image Restoration [6.0648320320309885]
VAR, a novel image generative paradigm, surpasses diffusion models in generation quality by applying a next-scale prediction approach.
We formulate the multi-scale latent representations within VAR as the restoration prior, thus advancing our delicately designed VarFormer framework.
arXiv Detail & Related papers (2024-12-30T16:32:55Z) - Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task [47.7670923159071]
This study introduces an innovative semantics DISentanglement and COmposition VERsatile (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks.
The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side.
At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks.
arXiv Detail & Related papers (2024-12-24T04:32:36Z) - AEMIM: Adversarial Examples Meet Masked Image Modeling [12.072673694665934]
We propose to incorporate adversarial examples into masked image modeling, as the new reconstruction targets.
In particular, we introduce a novel auxiliary pretext task that reconstructs the adversarial examples corresponding to the original images.
We also devise an innovative adversarial attack to craft more suitable adversarial examples for MIM pre-training.
arXiv Detail & Related papers (2024-07-16T09:39:13Z) - RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model [22.56227565913003]
We propose a comprehensive remote sensing image building model, termed RSBuilding, developed from the perspective of the foundation model.
RSBuilding is designed to enhance cross-scene generalization and task understanding.
Our model was trained on a dataset comprising up to 245,000 images and validated on multiple building extraction and change detection datasets.
arXiv Detail & Related papers (2024-03-12T11:51:59Z) - Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration [58.11518043688793]
MPerceiver is a novel approach to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration.
MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks.
arXiv Detail & Related papers (2023-12-05T17:47:11Z) - Unifying Image Processing as Visual Prompting Question Answering [62.84955983910612]
Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications.
Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise.
We propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks.
arXiv Detail & Related papers (2023-10-16T15:32:57Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Learning Task-relevant Representations for Generalization via
Characteristic Functions of Reward Sequence Distributions [63.773813221460614]
Generalization across different environments with the same tasks is critical for successful applications of visual reinforcement learning.
We propose a novel approach, namely Characteristic Reward Sequence Prediction (CRESP), to extract the task-relevant information.
Experiments demonstrate that CRESP significantly improves the performance of generalization on unseen environments.
arXiv Detail & Related papers (2022-05-20T14:52:03Z) - HiFaceGAN: Face Renovation via Collaborative Suppression and
Replenishment [63.333407973913374]
"Face Renovation"(FR) is a semantic-guided generation problem.
"HiFaceGAN" is a multi-stage framework containing several nested CSR units.
experiments on both synthetic and real face images have verified the superior performance of HiFaceGAN.
arXiv Detail & Related papers (2020-05-11T11:33:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.