Related papers: Faster Convergence for Transformer Fine-tuning with Line Search Methods

Faster Convergence for Transformer Fine-tuning with Line Search Methods

URL: http://arxiv.org/abs/2403.18506v1
Date: Wed, 27 Mar 2024 12:35:23 GMT
Title: Faster Convergence for Transformer Fine-tuning with Line Search Methods
Authors: Philip Kenneweg, Leonardo Galli, Tristan Kenneweg, Barbara Hammer,
Abstract summary: In this work we succeed in extending line search methods to the novel and highly popular Transformer architecture and dataset domains. Our work is publicly available as a python package, which provides a hyper-free gradient pytorch that is compatible with arbitrary network architectures.
Score: 6.138522679357102
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent works have shown that line search methods greatly increase performance of traditional stochastic gradient descent methods on a variety of datasets and architectures [1], [2]. In this work we succeed in extending line search methods to the novel and highly popular Transformer architecture and dataset domains in natural language processing. More specifically, we combine the Armijo line search with the Adam optimizer and extend it by subdividing the networks architecture into sensible units and perform the line search separately on these local units. Our optimization method outperforms the traditional Adam optimizer and achieves significant performance improvements for small data sets or small training budgets, while performing equal or better for other tested cases. Our work is publicly available as a python package, which provides a hyperparameter-free pytorch optimizer that is compatible with arbitrary network architectures.

Related papers

CaAdam: Improving Adam optimizer using connection aware methods [0.0]
We introduce a new method inspired by Adam that enhances convergence speed and achieves better loss function minima. Traditional proxies, including Adam, apply uniform or globally adjusted learning rates across neural networks without considering their architectural specifics. Our algorithm, CaAdam, explores this overlooked area by introducing connection-aware optimization through carefully designed of architectural information.
arXiv Detail & Related papers (2024-10-31T17:59:46Z)
No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation [4.45108516823267]
We identify problems of current state-of-the-art line search methods, propose enhancements, and rigorously assess their effectiveness. We evaluate these methods on orders of magnitude and more complex data domains than previously done. Our work is publicly available in a Python package, which provides a simple Pytorch.
arXiv Detail & Related papers (2024-07-30T08:47:02Z)
Improving Line Search Methods for Large Scale Neural Network Training [4.862490782515929]
We identify existing issues in state-of-the-art line search methods, propose enhancements, and rigorously evaluate their effectiveness. We improve the Armijo line search by integrating the momentum term from ADAM in its search direction, enabling efficient large-scale training. Our evaluation focuses on Transformers and CNNs in the domains of NLP and image data.
arXiv Detail & Related papers (2024-03-27T12:50:27Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
An algorithmic framework for the optimization of deep neural networks architectures and hyperparameters [0.23301643766310373]
We propose an algorithmic framework to automatically generate efficient deep neural networks. The framework is based on evolving directed acyclic graphs (DAGs) It allows mixtures of different classical operations: convolutions, recurrences and dense layers, but also more newfangled operations such as self-attention.
arXiv Detail & Related papers (2023-02-27T08:00:33Z)
Efficient Non-Parametric Optimizer Search for Diverse Tasks [93.64739408827604]
We present the first efficient scalable and general framework that can directly search on the tasks of interest. Inspired by the innate tree structure of the underlying math expressions, we re-arrange the spaces into a super-tree. We adopt an adaptation of the Monte Carlo method to tree search, equipped with rejection sampling and equivalent- form detection.
arXiv Detail & Related papers (2022-09-27T17:51:31Z)
Shapley-NAS: Discovering Operation Contribution for Neural Architecture Search [96.20505710087392]
We propose a Shapley value based method to evaluate operation contribution (Shapley-NAS) for neural architecture search. We show that our method outperforms the state-of-the-art methods by a considerable margin with light search cost.
arXiv Detail & Related papers (2022-06-20T14:41:49Z)
Pruning-as-Search: Efficient Neural Architecture Search via Channel Pruning and Structural Reparameterization [50.50023451369742]
Pruning-as-Search (PaS) is an end-to-end channel pruning method to search out desired sub-network automatically and efficiently. Our proposed architecture outperforms prior arts by around $1.0%$ top-1 accuracy on ImageNet-1000 classification task.
arXiv Detail & Related papers (2022-06-02T17:58:54Z)
DAAS: Differentiable Architecture and Augmentation Policy Search [107.53318939844422]
This work considers the possible coupling between neural architectures and data augmentation and proposes an effective algorithm jointly searching for them. Our approach achieves 97.91% accuracy on CIFAR-10 and 76.6% Top-1 accuracy on ImageNet dataset, showing the outstanding performance of our search algorithm.
arXiv Detail & Related papers (2021-09-30T17:15:17Z)
Rethinking Architecture Selection in Differentiable NAS [74.61723678821049]
Differentiable Neural Architecture Search is one of the most popular NAS methods for its search efficiency and simplicity. We propose an alternative perturbation-based architecture selection that directly measures each operation's influence on the supernet. We find that several failure modes of DARTS can be greatly alleviated with the proposed selection method.
arXiv Detail & Related papers (2021-08-10T00:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.