On the Effect of Pre-training for Transformer in Different Modality on
Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2211.09817v1
- Date: Thu, 17 Nov 2022 13:34:08 GMT
- Title: On the Effect of Pre-training for Transformer in Different Modality on
Offline Reinforcement Learning
- Authors: Shiro Takagi
- Abstract summary: We investigate how pre-training on data of different modalities, such as language and vision, affects fine-tuning of Transformer-based models to Mujoco offline reinforcement learning tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We empirically investigate how pre-training on data of different modalities,
such as language and vision, affects fine-tuning of Transformer-based models to
Mujoco offline reinforcement learning tasks. Analysis of the internal
representation reveals that the pre-trained Transformers acquire largely
different representations before and after pre-training, but acquire less
information of data in fine-tuning than the randomly initialized one. A closer
look at the parameter changes of the pre-trained Transformers reveals that
their parameters do not change that much and that the bad performance of the
model pre-trained with image data could partially come from large gradients and
gradient clipping. To study what information the Transformer pre-trained with
language data utilizes, we fine-tune this model with no context provided,
finding that the model learns efficiently even without context information.
Subsequent follow-up analysis supports the hypothesis that pre-training with
language data is likely to make the Transformer get context-like information
and utilize it to solve the downstream task.
Related papers
- Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Nonparametric Variational Regularisation of Pretrained Transformers [15.313475675235843]
We propose Non Variational Information Bottleneck (NVIB) as a regulariser for training cross-attention in Transformers.
We show that changing the initialisation introduces a novel, information-theoretic post-training regularisation in the attention mechanism.
arXiv Detail & Related papers (2023-12-01T15:40:30Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Don't Sweep your Learning Rate under the Rug: A Closer Look at
Cross-modal Transfer of Pretrained Transformers [1.9662978733004601]
Self-supervised pre-training of large-scale transformer models on text corpora followed by finetuning has achieved state-of-the-art on a number of natural language processing tasks.
In our work, we find that this result is, in fact, an artifact of not tuning the learning rates.
arXiv Detail & Related papers (2021-07-26T20:20:48Z) - Grounding inductive biases in natural images:invariance stems from
variations in data [20.432568247732206]
We study the factors of variation in a real dataset, ImageNet.
We show standard augmentation relies on a precise combination of translation and scale.
We find that the main factors of variation in ImageNet mostly relate to appearance.
arXiv Detail & Related papers (2021-06-09T14:58:57Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.