Related papers: ARCH: Efficient Adversarial Regularized Training with Caching

ARCH: Efficient Adversarial Regularized Training with Caching

URL: http://arxiv.org/abs/2109.07048v1
Date: Wed, 15 Sep 2021 02:05:37 GMT
Title: ARCH: Efficient Adversarial Regularized Training with Caching
Authors: Simiao Zuo, Chen Liang, Haoming Jiang, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, Tuo Zhao
Abstract summary: Adversarial regularization can improve model generalization in many natural language processing tasks. We propose a new adversarial regularization method ARCH, where perturbations are generated and cached once every several epochs. We evaluate our proposed method on a set of neural machine translation and natural language understanding tasks.
Score: 91.74682538906691
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adversarial regularization can improve model generalization in many natural language processing tasks. However, conventional approaches are computationally expensive since they need to generate a perturbation for each sample in each epoch. We propose a new adversarial regularization method ARCH (adversarial regularization with caching), where perturbations are generated and cached once every several epochs. As caching all the perturbations imposes memory usage concerns, we adopt a K-nearest neighbors-based strategy to tackle this issue. The strategy only requires caching a small amount of perturbations, without introducing additional training time. We evaluate our proposed method on a set of neural machine translation and natural language understanding tasks. We observe that ARCH significantly eases the computational burden (saves up to 70\% of computational time in comparison with conventional approaches). More surprisingly, by reducing the variance of stochastic gradients, ARCH produces a notably better (in most of the tasks) or comparable model generalization. Our code is publicly available.

Related papers

Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling [44.31966204357333]
We develop memory-efficient algorithms for large-scale machine learning problems. We use two key techniques to make our approach memory-efficient and avoid full computations.
arXiv Detail & Related papers (2025-02-20T15:37:45Z)
Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications. However, generalization properties of second-order methods are still being debated. We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z)
A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [121.0693322732454]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods to enhance CLIP's performance in downstream tasks. We revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.
arXiv Detail & Related papers (2024-02-06T15:45:27Z)
Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z)
Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load [11.069252535469644]
optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers. This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm. We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
arXiv Detail & Related papers (2023-04-17T20:12:18Z)
RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware. We explore trading off the computational precision to reduce the time complexity via sampling-based approximation. We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z)
Memory Efficient Continual Learning for Neural Text Classification [10.70710638820641]
We devise a method to perform text classification using pre-trained models on a sequence of classification tasks provided in sequence. We empirically demonstrate that our method requires significantly less model parameters compared to other state of the art methods. While our method suffers little forgetting, it retains a predictive performance on-par with state of the art but less memory efficient methods.
arXiv Detail & Related papers (2022-03-09T10:57:59Z)
Robust Learning-Augmented Caching: An Experimental Study [8.962235853317996]
Key optimization problem arising in caching cannot be optimally solved without knowing the future. New field of learning-augmented algorithms proposes solutions that leverage classical online caching algorithms. We show that a straightforward method has only a low overhead over a well-performing predictor.
arXiv Detail & Related papers (2021-06-28T13:15:07Z)
Evolving Reinforcement Learning Algorithms [186.62294652057062]
We propose a method for meta-learning reinforcement learning algorithms. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. We highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games.
arXiv Detail & Related papers (2021-01-08T18:55:07Z)
Variance reduction for Random Coordinate Descent-Langevin Monte Carlo [7.464874233755718]
Langevin Monte Carlo (LMC) that provides fast convergence requires computation of gradient approximations. In practice one uses finite-differencing approximations as surrogates, and the method is expensive in high-dimensions. We introduce a new variance reduction approach, termed Coordinates Averaging Descent (RCAD), and incorporate it with both overdamped and underdamped LMC.
arXiv Detail & Related papers (2020-06-10T21:08:38Z)
ScaIL: Classifier Weights Scaling for Class Incremental Learning [12.657788362927834]
In a deep learning approach, the constant computational budget requires the use of a fixed architecture for all incremental states. The bounded memory generates data imbalance in favor of new classes and a prediction bias toward them appears. We propose simple but efficient scaling of past class classifier weights to make them more comparable to those of new classes.
arXiv Detail & Related papers (2020-01-16T12:10:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.