Power-Law Decay Loss for Large Language Model Finetuning: A Theory Perspective
- URL: http://arxiv.org/abs/2505.16900v5
- Date: Mon, 09 Jun 2025 04:44:04 GMT
- Title: Power-Law Decay Loss for Large Language Model Finetuning: A Theory Perspective
- Authors: Jintian Shao,
- Abstract summary: This paper introduces a novel loss function, Power-Law Loss Decay (PDL), specifically designed to optimize the finetuning process for text generation.<n>PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.
Related papers
- IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation [70.2753541780788]
We introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding.<n>IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.
arXiv Detail & Related papers (2025-06-16T08:28:19Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment [19.48653924804823]
Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers.
However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech.
We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text.
arXiv Detail & Related papers (2024-06-25T22:18:52Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - Meta-Learning Online Adaptation of Language Models [88.8947656843812]
Large language models encode impressively broad world knowledge in their parameters.
However, the knowledge in static language models falls out of date, limiting the model's effective "shelf life"
arXiv Detail & Related papers (2023-05-24T11:56:20Z) - Refined Semantic Enhancement towards Frequency Diffusion for Video
Captioning [29.617527535279574]
Video captioning aims to generate natural language sentences that describe the given video accurately.
Existing methods obtain favorable generation by exploring richer visual representations in encode phase or improving the decoding ability.
We introduce a novel Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a captioning model that constantly perceives the linguistic representation of the infrequent tokens.
arXiv Detail & Related papers (2022-11-28T05:45:17Z) - On Long-Tailed Phenomena in Neural Machine Translation [50.65273145888896]
State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens.
We propose a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation.
We show the efficacy of the proposed technique on a number of Machine Translation (MT) datasets, demonstrating that it leads to significant gains over cross-entropy.
arXiv Detail & Related papers (2020-10-10T07:00:57Z) - Token-level Adaptive Training for Neural Machine Translation [84.69646428587548]
There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies.
vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies.
Low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected.
arXiv Detail & Related papers (2020-10-09T05:55:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.