Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom
- URL: http://arxiv.org/abs/2504.20000v1
- Date: Mon, 28 Apr 2025 17:19:25 GMT
- Title: Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom
- Authors: Rishika Sen, Sujoy Roychowdhury, Sumit Soman, H. G. Ranjani, Srikhetra Mohanty,
- Abstract summary: Knowledge Distillation (KD) is one of approaches to reduce the size of Large Language Models (LLMs)<n>For domain-specific tasks, it is not clear if teacher or student model, or both, must be considered for domain adaptation.<n>We design experiments to study the impact of vocabulary (same and different) and KD algorithms (vanilla KD and Dual Space KD, DSKD) on the distilled model.
- Score: 0.6897286554827872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) is one of the approaches to reduce the size of Large Language Models (LLMs). A LLM with smaller number of model parameters (student) is trained to mimic the performance of a LLM of a larger size (teacher model) on a specific task. For domain-specific tasks, it is not clear if teacher or student model, or both, must be considered for domain adaptation. In this work, we study this problem from perspective of telecom domain Question-Answering (QA) task. We systematically experiment with Supervised Fine-tuning (SFT) of teacher only, SFT of student only and SFT of both prior to KD. We design experiments to study the impact of vocabulary (same and different) and KD algorithms (vanilla KD and Dual Space KD, DSKD) on the distilled model. Multi-faceted evaluation of the distillation using 14 different metrics (N-gram, embedding and LLM-based metrics) is considered. Experimental results show that SFT of teacher improves performance of distilled model when both models have same vocabulary, irrespective of algorithm and metrics. Overall, SFT of both teacher and student results in better performance across all metrics, although the statistical significance of the same depends on the vocabulary of the teacher models.
Related papers
- Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking [56.46309219272326]
For large language models (LLMs), classification via supervised fine-tuning (SFT) predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs.<n>This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference?<n>We conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground.
arXiv Detail & Related papers (2025-10-16T16:02:27Z) - Distillation of Large Language Models via Concrete Score Matching [28.320219993420434]
Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference.<n>We propose a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set.<n> Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques.
arXiv Detail & Related papers (2025-09-30T06:21:28Z) - An Empirical Study of Knowledge Distillation for Code Understanding Tasks [19.64130505527951]
Knowledge distillation (KD) addresses limitations by transferring knowledge from large teacher models to compact student models.<n>This paper systematically investigates the effectiveness and usage of KD in code understanding tasks.
arXiv Detail & Related papers (2025-08-21T10:24:48Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.
Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.
Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - Mentor-KD: Making Small Language Models Better Multi-step Reasoners [15.159415340059388]
We propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller LMs.
We exploit a mentor, intermediate-sized task-specific fine-tuned model, to augment additional CoT annotations.
We conduct extensive experiments and confirm Mentor-KD's effectiveness across various models and complex reasoning tasks.
arXiv Detail & Related papers (2024-10-11T17:53:27Z) - Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-Review [11.756344944226495]
We introduce a novel Fault-Aware DistIllation via Peer-Review (FAIR) approach.<n>Instead of merely obtaining rationales from teachers, our method asks teachers to identify and explain the student's mistakes.
arXiv Detail & Related papers (2024-10-04T17:59:41Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)<n>We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.<n>We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments [4.541309099803903]
This study proposes a method for knowledge distillation (KD) of fine-tuned Large Language Models (LLMs)
We specifically target the challenge of deploying these models on resource-constrained devices.
arXiv Detail & Related papers (2023-12-26T01:24:25Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.