Related papers: Optimizing transformer-based machine translation model for single GPU training: a hyperparameter ablation study

Optimizing transformer-based machine translation model for single GPU training: a hyperparameter ablation study

URL: http://arxiv.org/abs/2308.06017v1
Date: Fri, 11 Aug 2023 08:47:52 GMT
Title: Optimizing transformer-based machine translation model for single GPU training: a hyperparameter ablation study
Authors: Luv Verma, Ketaki N. Kolhatkar
Abstract summary: In machine translation tasks, the relationship between model complexity and performance is often presumed to be linear. This study systematically investigates the effects of hyperparameters through ablation on a sequence-to-sequence machine translation pipeline. Contrary to expectations, our experiments reveal that combinations with the most parameters were not necessarily the most effective.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In machine translation tasks, the relationship between model complexity and performance is often presumed to be linear, driving an increase in the number of parameters and consequent demands for computational resources like multiple GPUs. To explore this assumption, this study systematically investigates the effects of hyperparameters through ablation on a sequence-to-sequence machine translation pipeline, utilizing a single NVIDIA A100 GPU. Contrary to expectations, our experiments reveal that combinations with the most parameters were not necessarily the most effective. This unexpected insight prompted a careful reduction in parameter sizes, uncovering "sweet spots" that enable training sophisticated models on a single GPU without compromising translation quality. The findings demonstrate an intricate relationship between hyperparameter selection, model size, and computational resource needs. The insights from this study contribute to the ongoing efforts to make machine translation more accessible and cost-effective, emphasizing the importance of precise hyperparameter tuning over mere scaling.

Related papers

Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [56.58170370127227]
We show that optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. This work is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers.
arXiv Detail & Related papers (2025-03-06T18:58:29Z)
Hyper Compressed Fine-Tuning of Large Foundation Models with Quantum Inspired Adapters [0.0]
emphQuantum-Inspired Adapters, a PEFT approach inspired by Hamming-weight quantum circuits from quantum machine learning literature. We test our proposed adapters by adapting large language models and large vision transformers on benchmark datasets.
arXiv Detail & Related papers (2025-02-10T13:06:56Z)
Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function [24.457000214575245]
We introduce a new technique to characterize the discontinuities and oscillations of the utility function on any fixed problem instance. This can be used to show that the learning theoretic complexity of the corresponding family of utility functions is bounded.
arXiv Detail & Related papers (2025-01-23T15:10:51Z)
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
Efficient Hyperparameter Importance Assessment for CNNs [1.7778609937758323]
This paper aims to quantify the importance weights of some hyperparameters in Convolutional Neural Networks (CNNs) with an algorithm called N-RReliefF. We conduct an extensive study by training over ten thousand CNN models across ten popular image classification datasets.
arXiv Detail & Related papers (2024-10-11T15:47:46Z)
Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work. Our empirical investigation includes tens of thousands of models trained with all combinations of threes. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z)
ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections [59.839926875976225]
We propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters.
arXiv Detail & Related papers (2024-05-30T17:26:02Z)
Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models [0.44998333629984877]
We propose a methodology that uses deep sequence- to-sequence models to predict the optimal tuning parameters governing compute kernels. The proposed algorithm can achieve more than 90% accuracy on various convolutional kernels in MIOpen, the AMD machine learning primitives library.
arXiv Detail & Related papers (2024-04-15T22:25:54Z)
Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis [51.14136878142034]
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. Existing methods for model adaptation usually update all model parameters, which is inefficient as it relies on high computational costs. In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency.
arXiv Detail & Related papers (2024-03-03T08:25:04Z)
Sliced gradient-enhanced Kriging for high-dimensional function approximation [2.8228516010000617]
Gradient-enhanced Kriging (GE-Kriging) is a well-established surrogate modelling technique for approximating expensive computational models. It tends to get impractical for high-dimensional problems due to the size of the inherent correlation matrix. A new method, called sliced GE-Kriging (SGE-Kriging), is developed in this paper for reducing the size of the correlation matrix. The results show that the SGE-Kriging model features an accuracy and robustness that is comparable to the standard one but comes at much less training costs.
arXiv Detail & Related papers (2022-04-05T07:27:14Z)
PowerGraph: Using neural networks and principal components to multivariate statistical power trade-offs [0.0]
A priori statistical power estimation for planned studies with multiple model parameters is inherently a multivariate problem. Explicit solutions in such cases are either impractical or impossible to solve, leaving researchers with the prevailing method of simulating power. This paper explores the efficient estimation and graphing of statistical power for a study over varying model parameter combinations.
arXiv Detail & Related papers (2021-12-29T19:06:29Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Experimental Investigation and Evaluation of Model-based Hyperparameter Optimization [0.3058685580689604]
This article presents an overview of theoretical and practical results for popular machine learning algorithms. The R package mlr is used as a uniform interface to the machine learning models.
arXiv Detail & Related papers (2021-07-19T11:37:37Z)
Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time. Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.