Related papers: Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT

Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT

URL: http://arxiv.org/abs/2511.03005v1
Date: Tue, 04 Nov 2025 21:17:49 GMT
Title: Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT
Authors: Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi,
Abstract summary: We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks.<n>The pipeline first analyzes and categorizes common errors in summaries produced by a teacher model (GPT-3.5), then performs a targeted revision using a compact editor model (Llama 3.1 70B) to generate high-quality, refined training data.<n>Fine-tuning a smaller student model (Llama 3.1 8B) on this refined data resulted in superior summarization performance compared to GPT-3.5.
Score: 9.303322147751908
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks. The pipeline first analyzes and categorizes common errors in summaries produced by a teacher model (GPT-3.5), then performs a targeted revision using a compact editor model (Llama 3.1 70B) to generate high-quality, refined training data. Fine-tuning a smaller student model (Llama 3.1 8B) on this refined data resulted in superior summarization performance compared to GPT-3.5. The ARF pipeline improves cost efficiency and data privacy while maintaining competitive accuracy, illustrating a generalizable framework for enhancing open-source LLMs across diverse downstream applications.

Related papers

Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs [60.68927774057402]
We show, for the first time, that a lower simplicity bias induces a better generalization.<n>Motivated by this insight, we demonstrate that the training data distribution by upsampling or augmenting examples learned later in training similarly reduces SB and leads to improved generalization.<n>Our strategy improves the performance of multiple language models including Phi2-2.7B, Llama3.2-1B, Gemma3-1B-PT, Qwen3-0.6B-Base-achieving relative accuracy gains up to 18% when fine-tuned with AdamW and Muon.
arXiv Detail & Related papers (2026-01-31T07:40:36Z)
One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z)
Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models [16.470481192733676]
PaDA-Agent is an evaluation-driven approach that streamlines the data augmentation process for SLMs.<n>Our experimental results demonstrate significant improvements over state-of-the-art LLM-based data augmentation approaches for Llama 3.2 1B Instruct model fine-tuning.
arXiv Detail & Related papers (2025-10-20T22:36:46Z)
CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora [0.0]
CustomIR is a framework for unsupervised adaptation of language embedding models to domain-specific corpora.<n>Our experiments show that CustomIR consistently improves retrieval effectiveness with small models gaining up to 2.3 points in Recall@10.<n>These results highlight that targeted synthetic fine-tuning offers a scalable and cost-efficient strategy for increasing domain-specific performance.
arXiv Detail & Related papers (2025-09-30T00:25:47Z)
SeaPO: Strategic Error Amplification for Robust Preference Optimization of Large Language Models [25.689746306171276]
We introduce SeaPO, a Strategic Error Amplification method that leverages three error types commonly occurring in large language models.<n>We show that SeaPO significantly improved overall model performance, particularly in terms of truthfulness.<n>Injecting the most common error types improves performance in related tasks, while a mix of error types leads to a broader performance enhancement.
arXiv Detail & Related papers (2025-09-29T13:42:41Z)
Enhancement Report Approval Prediction: A Comparative Study of Large Language Models [10.243182983724585]
Enhancement reports (ERs) serve as a critical communication channel between users and developers, capturing valuable suggestions for software improvement.<n>To address this challenge, enhancement report approval prediction (ERAP) has emerged as a research focus.<n>Recent advancements in large language models (LLM) present new opportunities for enhancing prediction accuracy.
arXiv Detail & Related papers (2025-06-18T03:08:04Z)
KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG [63.82127103851471]
Retrieval-Augmented Generation (RAG) enables large language models to access broader knowledge sources.<n>We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance.<n>We present KARE-RAG, which improves knowledge utilization through three key innovations.
arXiv Detail & Related papers (2025-06-03T06:31:17Z)
Language Models are Few-Shot Graders [0.12289361708127876]
We present an ASAG pipeline leveraging state-of-the-art LLMs.<n>We compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview.<n>Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection.
arXiv Detail & Related papers (2025-02-18T23:38:21Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.<n>LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.<n>Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data. One key challenge in federated learning is to handle non-identically distributed data across the clients. We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z)
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.