NDP: Next Distribution Prediction as a More Broad Target
- URL: http://arxiv.org/abs/2408.17377v1
- Date: Fri, 30 Aug 2024 16:13:49 GMT
- Title: NDP: Next Distribution Prediction as a More Broad Target
- Authors: Junhao Ruan, Abudukeyumu Abudula, Xinyu Liu, Bei Li, Yinqiao Li, Chenglong Wang, Yuchun Fan, Yuan Ge, Tong Xiao, Jingbo Zhu,
- Abstract summary: We introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets.
NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain.
- Score: 59.30497395313209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the $n$-gram distribution and the one-hot distribution with LLMs, we observed that the $n$-gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets, enhancing learning without extra online training time. We conducted experiments across translation, general task, language transfer, and medical domain adaptation. Compared to NTP, NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain. This demonstrates the concrete benefits of addressing the target narrowing problem, pointing to a new direction for future work on improving NTP.
Related papers
- A Bayesian Approach to Data Point Selection [24.98069363998565]
Data point selection (DPS) is becoming a critical topic in deep learning.
Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation.
We propose a novel Bayesian approach to DPS.
arXiv Detail & Related papers (2024-11-06T09:04:13Z) - Correlation and Navigation in the Vocabulary Key Representation Space of Language Models [33.747872934103334]
We study the effect of the key distribution on the NTP distribution.
We show that in the NTP distribution, the few top-ranked tokens are typically accurate.
We extend our method to open-ended and chain-of-thought (for reasoning) generation.
arXiv Detail & Related papers (2024-10-03T08:07:55Z) - VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates.
We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - Lessons Learned from a Unifying Empirical Study of Parameter-Efficient Transfer Learning (PETL) in Visual Recognition [36.031972728327894]
We study representative PETL methods in the context of Vision Transformers.
PETL methods can obtain similar accuracy in the low-shot benchmark VTAB-1K.
PETL is also useful in many-shot regimes -- it achieves comparable and sometimes better accuracy than full FT.
arXiv Detail & Related papers (2024-09-24T19:57:40Z) - Distribution Shift Inversion for Out-of-Distribution Prediction [57.22301285120695]
We propose a portable Distribution Shift Inversion algorithm for Out-of-Distribution (OoD) prediction.
We show that our method provides a general performance gain when plugged into a wide range of commonly used OoD algorithms.
arXiv Detail & Related papers (2023-06-14T08:00:49Z) - PDE+: Enhancing Generalization via PDE with Adaptive Distributional
Diffusion [66.95761172711073]
generalization of neural networks is a central challenge in machine learning.
We propose to enhance it directly through the underlying function of neural networks, rather than focusing on adjusting input data.
We put this theoretical framework into practice as $textbfPDE+$ ($textbfPDE$ with $textbfA$daptive $textbfD$istributional $textbfD$iffusion)
arXiv Detail & Related papers (2023-05-25T08:23:26Z) - Distributed NLI: Learning to Predict Human Opinion Distributions for
Language Reasoning [76.17436599516074]
We introduce distributed NLI, a new NLU task with a goal to predict the distribution of human judgements for natural language inference.
We show that models can capture human judgement distribution by applying additional distribution estimation methods, namely, Monte Carlo (MC) Dropout, Deep Ensemble, Re-Calibration, and Distribution Distillation.
arXiv Detail & Related papers (2021-04-18T01:25:19Z) - Mind the Trade-off: Debiasing NLU Models without Degrading the
In-distribution Performance [70.31427277842239]
We introduce a novel debiasing method called confidence regularization.
It discourages models from exploiting biases while enabling them to receive enough incentive to learn from all the training examples.
We evaluate our method on three NLU tasks and show that, in contrast to its predecessors, it improves the performance on out-of-distribution datasets.
arXiv Detail & Related papers (2020-05-01T11:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.