Training Compute-Optimal Vision Transformers for Brain Encoding
- URL: http://arxiv.org/abs/2410.19810v1
- Date: Thu, 17 Oct 2024 12:54:50 GMT
- Title: Training Compute-Optimal Vision Transformers for Brain Encoding
- Authors: Sana Ahmadi, Francois Paugam, Tristan Glatard, Pierre Lune Bellec,
- Abstract summary: The optimal training of a vision transformer for brain encoding depends on three factors: model size, data size, and computational resources.
This study investigates the effects of data scaling,temporal model scaling, and high-performance computing on brain encoding results.
- Score: 0.46873264197900916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The optimal training of a vision transformer for brain encoding depends on three factors: model size, data size, and computational resources. This study investigates these three pillars, focusing on the effects of data scaling, model scaling, and high-performance computing on brain encoding results. Using VideoGPT to extract efficient spatiotemporal features from videos and training a Ridge model to predict brain activity based on these features, we conducted benchmark experiments with varying data sizes (10k, 100k, 1M, 6M) and different model configurations of GPT-2, including hidden layer dimensions, number of layers, and number of attention heads. We also evaluated the effects of training models with 32-bit vs 16-bit floating point representations. Our results demonstrate that increasing the hidden layer dimensions significantly improves brain encoding performance, as evidenced by higher Pearson correlation coefficients across all subjects. In contrast, the number of attention heads does not have a significant effect on the encoding results. Additionally, increasing the number of layers shows some improvement in brain encoding correlations, but the trend is not as consistent as that observed with hidden layer dimensions. The data scaling results show that larger training datasets lead to improved brain encoding performance, with the highest Pearson correlation coefficients observed for the largest dataset size (6M). These findings highlight that the effects of data scaling are more significant compared to model scaling in enhancing brain encoding performance. Furthermore, we explored the impact of floating-point precision by comparing 32-bit and 16-bit representations. Training with 16-bit precision yielded the same brain encoding accuracy as 32-bit, while reducing training time by 1.17 times, demonstrating its efficiency for high-performance computing tasks.
Related papers
- OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruct-tuning models leads to an imbalanced computation load across different devices.
We rebalanced the computational loads from data, model, and memory perspectives to address this issue.
Our method's efficacy and generalizability were further demonstrated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z) - On the Scalability of Diffusion-based Text-to-Image Generation [97.64837704129005]
We study scaling properties of diffusion based text-to-image (T2I) models.
For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs.
On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size.
arXiv Detail & Related papers (2024-04-03T17:34:28Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - Scaling laws for language encoding models in fMRI [47.498241053872924]
We tested whether larger open-source models are better at predicting brain responses recorded using fMRI.
Similar logarithmic behavior was observed when scaling the size of the fMRI training set.
These results suggest that increasing scale in both models and data will yield incredibly effective models of language processing in the brain.
arXiv Detail & Related papers (2023-05-19T17:53:03Z) - Analyzing the Performance of Deep Encoder-Decoder Networks as Surrogates
for a Diffusion Equation [0.0]
We study the use of encoder-decoder convolutional neural network (CNN) as surrogates for steady-state diffusion solvers.
Our results indicate that increasing the size of the training set has a substantial effect on reducing performance fluctuations and overall error.
arXiv Detail & Related papers (2023-02-07T22:53:19Z) - Deep learning for ECoG brain-computer interface: end-to-end vs.
hand-crafted features [4.7773230870500605]
Brain signals are temporal data with a low signal-to-noise ratio, uncertain labels, and nonstationary data in time.
These factors may influence the training process and slow down the models' performance improvement.
This paper compares models that use raw ECoG signal and time-frequency features for BCI motor imagery decoding.
arXiv Detail & Related papers (2022-10-05T20:18:30Z) - Impact of dataset size and long-term ECoG-based BCI usage on deep
learning decoders performance [4.7773230870500605]
In brain-computer interfaces (BCI) research, recording data is time-consuming and expensive.
Can we achieve higher decoding performance with more data to train decoders?
High decoding performance was obtained with relatively small datasets recorded later in the experiment.
arXiv Detail & Related papers (2022-09-08T13:01:05Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.