Related papers: Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier

Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier

URL: http://arxiv.org/abs/2405.15780v1
Date: Wed, 17 Apr 2024 19:57:07 GMT
Title: Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier
Authors: Aristeidis Tsaris, Chengming Zhang, Xiao Wang, Junqi Yin, Siyan Liu, Moetasim Ashfaq, Ming Fan, Jong Youl Choi, Mohamed Wahib, Dan Lu, Prasanna Balaprakash, Feiyi Wang,
Abstract summary: Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training.
Score: 10.844784589626231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training, achieving a 94% batch scaling efficiency on 2,048 AMD-MI250X GPUs. Evaluating sequence parallelism in ViTs, particularly in models up to 10B parameters, highlighted substantial bottlenecks. We countered these with hybrid sequence, pipeline, tensor parallelism, and flash attention strategies, to scale beyond single GPU memory limits. Our method significantly enhances climate modeling accuracy by 20% in temperature predictions, marking the first training of a transformer model on a full-attention matrix over 188K sequence length.

Related papers

Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself. We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z)
Were RNNs All We Needed? [55.822693848969855]
In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs) We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that use fewer parameters than their traditional counterparts, are fully parallelizable during training, and achieve surprisingly competitive performance on a range of tasks, rivalling recent models including Transformers.
arXiv Detail & Related papers (2024-10-02T03:06:49Z)
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z)
Ultra-Long Sequence Distributed Transformer [10.263668150008316]
Transformer models trained on long sequences often achieve higher accuracy than short sequences. Existing methods for long sequence training offer limited speedup and memory reduction. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer.
arXiv Detail & Related papers (2023-11-04T11:38:53Z)
PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations. We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z)
Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training. As-ViT automatically discovers and scales up ViTs in an efficient and principled manner. As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z)
TerViT: An Efficient Ternary Vision Transformer [21.348788407233265]
Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. We introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters.
arXiv Detail & Related papers (2022-01-20T08:29:19Z)
ViR:the Vision Reservoir [10.881974985012839]
Vision Reservoir computing (ViR) is proposed here for image classification, as a parallel to Vision Transformer (ViT) By splitting each image into a sequence of tokens with fixed length, the ViR constructs a pure reservoir with a nearly fully connected topology to replace the Transformer module in ViT. The number of parameters of the ViR is about 15% even 5% of the ViT, and the memory footprint is about 20% to 40% of the ViT.
arXiv Detail & Related papers (2021-12-27T07:07:50Z)
Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition [66.47000813920617]
We propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition. The proposed model can accurately predict the length of the target sequence and achieve a competitive performance. The model even achieves a real-time factor of 0.0056, which exceeds all mainstream speech recognition models.
arXiv Detail & Related papers (2020-05-16T08:27:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.