PDiT: Interleaving Perception and Decision-making Transformers for Deep
Reinforcement Learning
- URL: http://arxiv.org/abs/2312.15863v1
- Date: Tue, 26 Dec 2023 03:07:10 GMT
- Title: PDiT: Interleaving Perception and Decision-making Transformers for Deep
Reinforcement Learning
- Authors: Hangyu Mao, Rui Zhao, Ziyue Li, Zhiwei Xu, Hao Chen, Yiqun Chen, Bin
Zhang, Zhen Xiao, Junge Zhang, and Jiangjin Yin
- Abstract summary: Perception and Decision-making Interleaving Transformer (PDiT) network is proposed.
Experiments show that PDiT can not only achieve superior performance than strong baselines but also extractable feature representations.
- Score: 27.128220336919195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Designing better deep networks and better reinforcement learning (RL)
algorithms are both important for deep RL. This work studies the former.
Specifically, the Perception and Decision-making Interleaving Transformer
(PDiT) network is proposed, which cascades two Transformers in a very natural
way: the perceiving one focuses on \emph{the environmental perception} by
processing the observation at the patch level, whereas the deciding one pays
attention to \emph{the decision-making} by conditioning on the history of the
desired returns, the perceiver's outputs, and the actions. Such a network
design is generally applicable to a lot of deep RL settings, e.g., both the
online and offline RL algorithms under environments with either image
observations, proprioception observations, or hybrid image-language
observations. Extensive experiments show that PDiT can not only achieve
superior performance than strong baselines in different settings but also
extract explainable feature representations. Our code is available at
\url{https://github.com/maohangyu/PDiT}.
Related papers
- Adaptive Step-size Perception Unfolding Network with Non-local Hybrid Attention for Hyperspectral Image Reconstruction [0.39134031118910273]
We propose an adaptive step-size perception unfolding network (ASPUN), a deep unfolding network based on FISTA algorithm.
In addition, we design a Non-local Hybrid Attention Transformer(NHAT) module for fully leveraging the receptive field advantage of transformer.
Experimental results show that our ASPUN is superior to the existing SOTA algorithms and achieves the best performance.
arXiv Detail & Related papers (2024-07-04T16:09:52Z) - CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - RePo: Resilient Model-Based Reinforcement Learning by Regularizing
Posterior Predictability [25.943330238941602]
We propose a visual model-based RL method that learns a latent representation resilient to spurious variations.
Our training objective encourages the representation to be maximally predictive of dynamics and reward.
Our effort is a step towards making model-based RL a practical and useful tool for dynamic, diverse domains.
arXiv Detail & Related papers (2023-08-31T18:43:04Z) - Transformer in Transformer as Backbone for Deep Reinforcement Learning [43.354375917223656]
We propose to design emphpure Transformer-based networks for deep RL.
The Transformer in Transformer (TIT) backbone is proposed, which cascades two Transformers in a very natural way.
Experiments show that TIT can achieve satisfactory performance in different settings consistently.
arXiv Detail & Related papers (2022-12-30T03:50:38Z) - On Transforming Reinforcement Learning by Transformer: The Development
Trajectory [97.79247023389445]
Transformer, originally devised for natural language processing, has also attested significant success in computer vision.
We group existing developments in two categories: architecture enhancement and trajectory optimization.
We examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving.
arXiv Detail & Related papers (2022-12-29T03:15:59Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.