Related papers: Pre-training Language Model as a Multi-perspective Course Learner

Pre-training Language Model as a Multi-perspective Course Learner

URL: http://arxiv.org/abs/2305.03981v1
Date: Sat, 6 May 2023 09:02:10 GMT
Title: Pre-training Language Model as a Multi-perspective Course Learner
Authors: Beiduo Chen, Shaohan Huang, Zihan Zhang, Wu Guo, Zhenhua Ling, Haizhen Huang, Furu Wei, Weiwei Deng and Qi Zhang
Abstract summary: This study proposes a multi-perspective course learning (MCL) method for sample-efficient pre-training. In this study, three self-supervision courses are designed to alleviate inherent flaws of "tug-of-war" dynamics. Our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks.
Score: 103.17674402415582
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: ELECTRA, the generator-discriminator pre-training framework, has achieved impressive semantic construction capability among various downstream tasks. Despite the convincing performance, ELECTRA still faces the challenges of monotonous training and deficient interaction. Generator with only masked language modeling (MLM) leads to biased learning and label imbalance for discriminator, decreasing learning efficiency; no explicit feedback loop from discriminator to generator results in the chasm between these two components, underutilizing the course learning. In this study, a multi-perspective course learning (MCL) method is proposed to fetch a many degrees and visual angles for sample-efficient pre-training, and to fully leverage the relationship between generator and discriminator. Concretely, three self-supervision courses are designed to alleviate inherent flaws of MLM and balance the label in a multi-perspective way. Besides, two self-correction courses are proposed to bridge the chasm between the two encoders by creating a "correction notebook" for secondary-supervision. Moreover, a course soups trial is conducted to solve the "tug-of-war" dynamics problem of MCL, evolving a stronger pre-trained model. Experimental results show that our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks, and overshadows recent advanced ELECTRA-style models under the same settings. The pre-trained MCL model is available at https://huggingface.co/McmanusChen/MCL-base.

Related papers

Token-Level Uncertainty-Aware Objective for Language Model Post-Training [2.5671111123644894]
We connect token-level uncertainty in causal language modeling to two types of training objectives: 1) masked maximum likelihood (MLE), 2) self-distillation. We show that masked MLE is effective in reducing epistemic uncertainty, and serve as an effective token-level automatic curriculum learning technique. However, masked MLE is prone to overfitting and requires self-distillation regularization to improve or maintain performance on out-of-distribution tasks.
arXiv Detail & Related papers (2025-03-15T00:32:14Z)
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter [9.631036588583248]
Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make it harder for the draft model to converge. We propose CORAL, a novel framework that improves both accuracy and efficiency in speculative drafting.
arXiv Detail & Related papers (2025-02-24T06:28:26Z)
Light-Weight Fault Tolerant Attention for Large Language Model Training [14.178223242134166]
Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. LLMs are susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs. We propose ATTNChecker, the first Algorithm-Based Fault Tolerance (ABFT) technique tailored for the attention mechanism in LLMs.
arXiv Detail & Related papers (2024-10-15T15:52:45Z)
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model. Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z)
Mitigating Social Biases in Language Models through Unlearning [16.166946020697203]
Mitigating bias in language models (LMs) has become a critical problem due to the widespread deployment of LMs. We explore two unlearning methods, (1) Partitioned Contrastive Gradient Unlearning (PCGU) applied on decoder models and (2) Negation via Task Vector. On LLaMA-27B, negation via Task Vector reduces the bias score by 11.8%.
arXiv Detail & Related papers (2024-06-19T13:38:34Z)
MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning [9.271196993624944]
Large Language models (LLMs) make predictions for a given test input together with a few input-output pairs (demonstrations) Existing solutions attempt to distill lengthy demonstrations into compact vectors. We present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task.
arXiv Detail & Related papers (2024-03-11T17:03:04Z)
Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning [45.25602203155762]
Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. A major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression. We propose a novel model-agnostic Multistage Contrastive Learning framework.
arXiv Detail & Related papers (2024-02-19T04:13:33Z)
On Task Performance and Model Calibration with Supervised and Self-Ensembled In-Context Learning [71.44986275228747]
In-context learning (ICL) has become an efficient approach propelled by the recent advancements in large language models (LLMs) However, both paradigms are prone to suffer from the critical problem of overconfidence (i.e., miscalibration)
arXiv Detail & Related papers (2023-12-21T11:55:10Z)
Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels [53.68653940062605]
We introduce a novel task, Partial labeling and Long-Tailed Multi-Label Classification (PLT-MLC) We find that most LT-MLC and PL-MLC approaches fail to solve the degradation-MLC. We propose an end-to-end learning framework: textbfCOrrection $rightarrow$ textbfModificattextbfIon $rightarrow$ balantextbfCe.
arXiv Detail & Related papers (2023-04-20T20:05:08Z)
CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances. We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z)
Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework. For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement. For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z)
On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer [40.63693071222628]
We study the minimum word error rate (MWER) training of Hybrid Autoregressive Transducer (HAT) From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models.
arXiv Detail & Related papers (2020-10-23T21:16:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.