Related papers: Accurate Knowledge Distillation with n-best Reranking

Accurate Knowledge Distillation with n-best Reranking

URL: http://arxiv.org/abs/2305.12057v4
Date: Wed, 12 Jun 2024 18:28:01 GMT
Title: Accurate Knowledge Distillation with n-best Reranking
Authors: Hendra Setiawan,
Abstract summary: We propose utilizing n-best reranking to enhance Sequence-Level Knowledge Distillation (Kim and Rush, 2016) We leverage a diverse set of models with different inductive biases, objective functions or architectures, including some publicly-available large language models, to pick the highest-quality hypotheses as labels. Our results demonstrate that utilizing pseudo-labels generated by our n-best reranker leads to a significantly more accurate student model.
Score: 2.9526110883017433
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose utilizing n-best reranking to enhance Sequence-Level Knowledge Distillation (Kim and Rush, 2016) where we extract pseudo-labels for student model's training data from top n-best hypotheses and leverage a diverse set of models with different inductive biases, objective functions or architectures, including some publicly-available large language models, to pick the highest-quality hypotheses as labels. The effectiveness of our proposal is validated through experiments on the WMT'21 German-English and Chinese-English translation tasks. Our results demonstrate that utilizing pseudo-labels generated by our n-best reranker leads to a significantly more accurate student model. In fact, our best student model achieves comparable accuracy to a large translation model from (Tran et al., 2021) with 4.7 billion parameters, while having two orders of magnitude fewer parameters.

Related papers

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition [12.125413756152833]
We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts.<n>Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotate them with multilingual LLMs.<n>Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings.
arXiv Detail & Related papers (2025-12-15T20:36:39Z)
Logits-Based Finetuning [48.18151583153572]
We propose a logits-based fine-tuning framework that integrates the strengths of supervised learning and knowledge distillation.<n>Our approach constructs enriched training targets by combining teacher logits with ground truth labels, preserving both correctness and linguistic diversity.
arXiv Detail & Related papers (2025-05-30T10:57:09Z)
Matryoshka Model Learning for Improved Elastic Student Models [62.154536258259384]
MatTA is a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe.<n>We demonstrate our method on GPT-2 Medium, a public model, and achieve relative improvements of over 24% on SAT Math and over 10% on the LAMBADA benchmark.
arXiv Detail & Related papers (2025-05-29T10:54:58Z)
Language Models are Few-Shot Graders [0.12289361708127876]
We present an ASAG pipeline leveraging state-of-the-art LLMs. We compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection.
arXiv Detail & Related papers (2025-02-18T23:38:21Z)
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators [45.49880507108965]
"GenTranslate" builds upon large language models to generate better results from diverse translation versions in N-best list. Our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result.
arXiv Detail & Related papers (2024-02-10T07:20:49Z)
Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches. We present UPET, a novel Uncertainty-aware self-Training framework. We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z)
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings. We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z)
A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled. We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples. We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z)
An Application of Pseudo-Log-Likelihoods to Natural Language Scoring [5.382454613390483]
A language model with relatively few parameters and training steps can outperform it on a recent large data set. We produce some absolute state-of-the-art results for common sense reasoning in binary choice tasks. We argue that robustness of the smaller model ought to be understood in terms of compositionality.
arXiv Detail & Related papers (2022-01-23T22:00:54Z)
Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs) We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models [2.0498977512661267]
We evaluate transferability of BERT-based neural ranking models across five English datasets. Each of our collections has a substantial number of queries, which enables a full-shot evaluation mode. We find that training on pseudo-labels can produce a competitive or better model compared to transfer learning.
arXiv Detail & Related papers (2021-03-04T21:08:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.