FastFace: Fast-converging Scheduler for Large-scale Face Recognition Training with One GPU
- URL: http://arxiv.org/abs/2404.11118v2
- Date: Wed, 11 Dec 2024 15:14:55 GMT
- Title: FastFace: Fast-converging Scheduler for Large-scale Face Recognition Training with One GPU
- Authors: Xueyuan Gong, Zhiquan Liu, Yain-Whar Si, Xiaochen Yuan, Ke Wang, Xiaoxiang Liu, Cong Lin, Xinyuan Zhang,
- Abstract summary: We present FastFace, a fast-converging scheduler with negligible time complexity.<n>In practice, FastFace is able to accelerate Face Recognition model training to a quarter of its original time without sacrificing more than 1% accuracy.
- Score: 10.656812733659514
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computing power has evolved into a foundational and indispensable resource in the area of deep learning, particularly in tasks such as Face Recognition (FR) model training on large-scale datasets, where multiple GPUs are often a necessity. Recognizing this challenge, some FR methods have started exploring ways to compress the fully-connected layer in FR models. Unlike other approaches, our observations reveal that without prompt scheduling of the learning rate (LR) during FR model training, the loss curve tends to exhibit numerous stationary subsequences. To address this issue, we introduce a novel LR scheduler leveraging Exponential Moving Average (EMA) and Haar Convolutional Kernel (HCK) to eliminate stationary subsequences, resulting in a significant reduction in converging time. However, the proposed scheduler incurs a considerable computational overhead due to its time complexity. To overcome this limitation, we propose FastFace, a fast-converging scheduler with negligible time complexity, i.e. O(1) per iteration, during training. In practice, FastFace is able to accelerate FR model training to a quarter of its original time without sacrificing more than 1% accuracy, making large-scale FR training feasible even with just one single GPU in terms of both time and space complexity. Extensive experiments validate the efficiency and effectiveness of FastFace. The code is publicly available at: https://github.com/amoonfana/FastFace
Related papers
- Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters [24.845122459974466]
This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm.
By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models, A-SRPT strategically assigns jobs to the available GPU.
A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy.
arXiv Detail & Related papers (2025-01-09T20:19:01Z) - DeMo: Decoupled Momentum Optimization [6.169574689318864]
Training large neural networks typically requires sharing between accelerators through specialized high-speed interconnects.
We introduce bfDecoupled textbfMomentum (DeMo), a fused magnitude and data parallel algorithm that reduces inter-accelerator communication requirements.
Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW.
arXiv Detail & Related papers (2024-11-29T17:31:47Z) - Photon: Federated LLM Pre-Training [17.368070785118654]
We introduce Photon, the first complete system for federated end-to-end LLM training.
We show that Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training.
arXiv Detail & Related papers (2024-11-05T08:48:25Z) - Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
We propose an algorithm that enables fast and high-quality generation under arbitrary constraints.
During inference, we can interchange between gradient updates computed on the noisy image and updates computed on the final, clean image.
Our approach produces results that rival or surpass the state-of-the-art training-free inference approaches.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.
We empirically find that this training paradigm limits the one-step generation performance of consistency models.
We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Large Language Model (LLM) pretraining traditionally relies on autoregressive language modeling on randomly sampled data blocks from web-scale datasets.
We take inspiration from human learning techniques like spaced repetition to hypothesize that random data sampling for LLMs leads to high training cost and low quality models which tend to forget data.
In order to effectively commit web-scale information to long-term memory, we propose the LFR (Learn, Focus, and Review) pedagogy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping [14.435637320909663]
MoE technique plays crucial role in expanding the size of DNN model parameters.
Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation.
In our study, we extend the scope of this challenge by considering overlap at the broader training graph level.
We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training.
arXiv Detail & Related papers (2024-04-30T10:17:21Z) - Always-Sparse Training by Growing Connections with Guided Stochastic
Exploration [46.4179239171213]
We propose an efficient always-sparse training algorithm with excellent scaling to larger and sparser models.
We evaluate our method on CIFAR-10/100 and ImageNet using VGG, and ViT models, and compare it against a range of sparsification methods.
arXiv Detail & Related papers (2024-01-12T21:32:04Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Towards Federated Learning Under Resource Constraints via Layer-wise
Training and Depth Dropout [33.308067180286045]
Federated learning can be difficult to scale to large models when clients have limited resources.
We introduce Federated Layer-wise Learning to simultaneously reduce per-client memory, computation, and communication costs.
We also introduce Federated Depth Dropout, a complementary technique that randomly drops frozen layers during training, to further reduce resource usage.
arXiv Detail & Related papers (2023-09-11T03:17:45Z) - SPEED: Streaming Partition and Parallel Acceleration for Temporal
Interaction Graph Embedding [22.68416593780539]
We introduce a novel training approach namely Streaming Edge Partitioning and Parallel Acceleration for Temporal Interaction Graph Embedding.
Our method can achieve a good balance in computing resources, computing time, and downstream task performance.
Empirical validation across 7 real-world datasets demonstrates the potential to expedite training speeds by a factor of up to 19.29x.
arXiv Detail & Related papers (2023-08-27T15:11:44Z) - Learning Decorrelated Representations Efficiently Using Fast Fourier
Transform [3.932322649674071]
We propose a relaxed decorrelating regularizer that can be computed in O(n d log d) time by Fast Fourier Transform.
The proposed regularizer exhibits accuracy comparable to that of existing regularizers in downstream tasks, whereas their training requires less memory and is faster for large d.
arXiv Detail & Related papers (2023-01-04T12:38:08Z) - From Actions to Events: A Transfer Learning Approach Using Improved Deep
Belief Networks [1.0554048699217669]
This paper proposes a novel approach to map the knowledge from action recognition to event recognition using an energy-based model.
Such a model can process all frames simultaneously, carrying spatial and temporal information through the learning process.
arXiv Detail & Related papers (2022-11-30T14:47:10Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - EfficientTrain: Exploring Generalized Curriculum Learning for Training
Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers)
As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z) - RSC: Accelerating Graph Neural Networks Training via Randomized Sparse
Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware.
We explore trading off the computational precision to reduce the time complexity via sampling-based approximation.
We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z) - Nesterov Accelerated ADMM for Fast Diffeomorphic Image Registration [63.15453821022452]
Recent developments in approaches based on deep learning have achieved sub-second runtimes for DiffIR.
We propose a simple iterative scheme that functionally composes intermediate non-stationary velocity fields.
We then propose a convex optimisation model that uses a regularisation term of arbitrary order to impose smoothness on these velocity fields.
arXiv Detail & Related papers (2021-09-26T19:56:45Z) - Multi-Agent Semi-Siamese Training for Long-tail and Shallow Face
Learning [54.13876727413492]
In many real-world scenarios of face recognition, the depth of training dataset is shallow, which means only two face images are available for each ID.
With the non-uniform increase of samples, such issue is converted to a more general case, a.k.a a long-tail face learning.
Based on the Semi-Siamese Training (SST), we introduce an advanced solution, named Multi-Agent Semi-Siamese Training (MASST)
MASST includes a probe network and multiple gallery agents, the former aims to encode the probe features, and the latter constitutes a stack of
arXiv Detail & Related papers (2021-05-10T04:57:32Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - GPU-Accelerated Primal Learning for Extremely Fast Large-Scale
Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON.
We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z) - Semi-Siamese Training for Shallow Face Learning [78.7386209619276]
We introduce a novel training method named Semi-Siamese Training (SST)
A pair of Semi-Siamese networks constitute the forward propagation structure, and the training loss is computed with an updating gallery queue.
Our method is developed without extra-dependency, thus can be flexibly integrated with the existing loss functions and network architectures.
arXiv Detail & Related papers (2020-07-16T15:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.