Training Dynamics for Text Summarization Models
- URL: http://arxiv.org/abs/2110.08370v1
- Date: Fri, 15 Oct 2021 21:13:41 GMT
- Title: Training Dynamics for Text Summarization Models
- Authors: Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, Greg Durrett
- Abstract summary: We analyze the training dynamics for generation models, focusing on news summarization.
Across different datasets (CNN/DM, XSum, MediaSum) and summary properties, we study what the model learns at different stages of its fine-tuning process.
We find that properties such as copy behavior are learnt earlier in the training process and these observations are robust across domains.
On the other hand, factual errors, such as hallucination of unsupported facts, are learnt in the later stages, and this behavior is more varied across domains.
- Score: 45.62439188988816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (e.g. BART) have shown impressive results when
fine-tuned on large summarization datasets. However, little is understood about
this fine-tuning process, including what knowledge is retained from
pre-training models or how content selection and generation strategies are
learnt across iterations. In this work, we analyze the training dynamics for
generation models, focusing on news summarization. Across different datasets
(CNN/DM, XSum, MediaSum) and summary properties, such as abstractiveness and
hallucination, we study what the model learns at different stages of its
fine-tuning process. We find that properties such as copy behavior are learnt
earlier in the training process and these observations are robust across
domains. On the other hand, factual errors, such as hallucination of
unsupported facts, are learnt in the later stages, and this behavior is more
varied across domains. Based on these observations, we explore complementary
approaches for modifying training: first, disregarding high-loss tokens that
are challenging to learn and second, disregarding low-loss tokens that are
learnt very quickly. This simple training modification allows us to configure
our model to achieve different goals, such as improving factuality or improving
abstractiveness.
Related papers
- Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve? [19.34040322172224]
We show that training a model on a text domain could degrade its perplexity on the test portion of the same domain.
Our findings will guide us in determining when to adapt a model vs when to rely on its foundational capabilities.
arXiv Detail & Related papers (2024-10-08T00:37:16Z) - EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training [79.96741042766524]
We reformulate the training curriculum as a soft-selection function.
We show that exposing the contents of natural images can be readily achieved by the intensity of data augmentation.
The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective.
arXiv Detail & Related papers (2024-05-14T17:00:43Z) - Unlearning Traces the Influential Training Data of Language Models [31.33791825286853]
This paper presents UnTrac: unlearning traces the influence of a training dataset on the model's performance.
We propose a more scalable approach, UnTrac-Inv, which unlearns a test dataset and evaluates the unlearned model on training datasets.
arXiv Detail & Related papers (2024-01-26T23:17:31Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Overcoming Catastrophic Forgetting beyond Continual Learning: Balanced
Training for Neural Machine Translation [15.309573393914462]
Neural networks tend to forget the previously learned knowledge when learning multiple tasks sequentially from dynamic data distributions.
This problem is called textitcatastrophic forgetting, which is a fundamental challenge in the continual learning of neural networks.
We propose Complementary Online Knowledge Distillation (COKD), which uses dynamically updated teacher models trained on specific data orders to iteratively provide complementary knowledge to the student model.
arXiv Detail & Related papers (2022-03-08T08:08:45Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Efficient Learning of Model Weights via Changing Features During
Training [0.0]
We propose a machine learning model, which dynamically changes the features during training.
Our main motivation is to update the model in a small content during the training process with replacing less descriptive features to new ones from a large pool.
arXiv Detail & Related papers (2020-02-21T12:38:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.