Efficient Training of Large Vision Models via Advanced Automated Progressive Learning
- URL: http://arxiv.org/abs/2410.00350v1
- Date: Fri, 6 Sep 2024 16:24:24 GMT
- Title: Efficient Training of Large Vision Models via Advanced Automated Progressive Learning
- Authors: Changlin Li, Jiawei Zhang, Sihao Lin, Zongxin Yang, Junwei Liang, Xiaodan Liang, Xiaojun Chang,
- Abstract summary: We present an advanced automated progressive learning (AutoProg) framework for efficient training of Large Vision Models (LVMs)
We introduce AutoProg-Zero, by enhancing the AutoProg framework with a novel zero-shot unfreezing schedule search.
Experiments show that AutoProg accelerates ViT pre-training by up to 1.85x on ImageNet and accelerates fine-tuning of diffusion models by up to 2.86x, with comparable or even higher performance.
- Score: 96.71646528053651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancements in Large Vision Models (LVMs), such as Vision Transformers (ViTs) and diffusion models, have led to an increasing demand for computational resources, resulting in substantial financial and environmental costs. This growing challenge highlights the necessity of developing efficient training methods for LVMs. Progressive learning, a training strategy in which model capacity gradually increases during training, has shown potential in addressing these challenges. In this paper, we present an advanced automated progressive learning (AutoProg) framework for efficient training of LVMs. We begin by focusing on the pre-training of LVMs, using ViTs as a case study, and propose AutoProg-One, an AutoProg scheme featuring momentum growth (MoGrow) and a one-shot growth schedule search. Beyond pre-training, we extend our approach to tackle transfer learning and fine-tuning of LVMs. We expand the scope of AutoProg to cover a wider range of LVMs, including diffusion models. First, we introduce AutoProg-Zero, by enhancing the AutoProg framework with a novel zero-shot unfreezing schedule search, eliminating the need for one-shot supernet training. Second, we introduce a novel Unique Stage Identifier (SID) scheme to bridge the gap during network growth. These innovations, integrated with the core principles of AutoProg, offer a comprehensive solution for efficient training across various LVM scenarios. Extensive experiments show that AutoProg accelerates ViT pre-training by up to 1.85x on ImageNet and accelerates fine-tuning of diffusion models by up to 2.86x, with comparable or even higher performance. This work provides a robust and scalable approach to efficient training of LVMs, with potential applications in a wide range of vision tasks. Code: https://github.com/changlin31/AutoProg-Zero
Related papers
- TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.
Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.
We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z) - Affordance-Guided Reinforcement Learning via Visual Prompting [51.361977466993345]
Keypoint-based Affordance Guidance for Improvements (KAGI) is a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL.
On real-world manipulation tasks specified by natural language descriptions, KAGI improves the sample efficiency of autonomous RL and enables successful task completion in 20K online fine-tuning steps.
arXiv Detail & Related papers (2024-07-14T21:41:29Z) - DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning [61.10299147201369]
This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents.
We build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator.
We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement.
arXiv Detail & Related papers (2024-06-14T17:49:55Z) - Forging Vision Foundation Models for Autonomous Driving: Challenges,
Methodologies, and Opportunities [59.02391344178202]
Vision foundation models (VFMs) serve as potent building blocks for a wide range of AI applications.
The scarcity of comprehensive training data, the need for multi-sensor integration, and the diverse task-specific architectures pose significant obstacles to the development of VFMs.
This paper delves into the critical challenge of forging VFMs tailored specifically for autonomous driving, while also outlining future directions.
arXiv Detail & Related papers (2024-01-16T01:57:24Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Local Masking Meets Progressive Freezing: Crafting Efficient Vision
Transformers for Self-Supervised Learning [0.0]
We present an innovative approach to self-supervised learning for Vision Transformers (ViTs)
This method focuses on enhancing the efficiency and speed of initial layer training in ViTs.
Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers.
arXiv Detail & Related papers (2023-12-02T11:10:09Z) - EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge
Distillation and Modal-adaptive Pruning [19.354515754130592]
We introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones.
We apply our framework to train EfficientVLM, a fast and accurate vision-language model consisting of 6 vision layers, 3 text layers, and 3 cross-modal fusion layers.
EfficientVLM retains 98.4% performance of the teacher model and accelerates its inference speed by 2.2x.
arXiv Detail & Related papers (2022-10-14T13:26:41Z) - Automated Progressive Learning for Efficient Training of Vision
Transformers [125.22744987949227]
Vision Transformers (ViTs) have come with a voracious appetite for computing power, high-lighting the urgent need to develop efficient training methods for ViTs.
Progressive learning, a training scheme where the model capacity grows progressively during training, has started showing its ability in efficient training.
In this paper, we take a practical step towards efficient training of ViTs by customizing and automating progressive learning.
arXiv Detail & Related papers (2022-03-28T05:37:08Z) - Steadily Learn to Drive with Virtual Memory [11.67256846037979]
This paper proposes an algorithm called Learn to drive with Virtual Memory (LVM) to overcome these problems.
LVM compresses the high-dimensional information into compact latent states and learns a latent dynamic model to summarize the agent's experience.
The effectiveness of LVM is demonstrated by an image-input autonomous driving task.
arXiv Detail & Related papers (2021-02-16T10:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.