PROD: Progressive Distillation for Dense Retrieval
- URL: http://arxiv.org/abs/2209.13335v3
- Date: Sat, 24 Jun 2023 10:04:14 GMT
- Title: PROD: Progressive Distillation for Dense Retrieval
- Authors: Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong,
Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan
- Abstract summary: It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student.
We propose PROD, a PROgressive Distillation method, for dense retrieval.
- Score: 65.83300173604384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is an effective way to transfer knowledge from a
strong teacher to an efficient student model. Ideally, we expect the better the
teacher is, the better the student. However, this expectation does not always
come true. It is common that a better teacher model results in a bad student
via distillation due to the nonnegligible gap between teacher and student. To
bridge the gap, we propose PROD, a PROgressive Distillation method, for dense
retrieval. PROD consists of a teacher progressive distillation and a data
progressive distillation to gradually improve the student. We conduct extensive
experiments on five widely-used benchmarks, MS MARCO Passage, TREC Passage 19,
TREC Document 19, MS MARCO Document and Natural Questions, where PROD achieves
the state-of-the-art within the distillation methods for dense retrieval. The
code and models will be released.
Related papers
- Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process.
Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Unified and Effective Ensemble Knowledge Distillation [92.67156911466397]
Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model.
Many existing methods learn and distill the student model on labeled data only.
We propose a unified and effective ensemble knowledge distillation method that distills a single student model from an ensemble of teacher models on both labeled and unlabeled data.
arXiv Detail & Related papers (2022-04-01T16:15:39Z) - Controlling the Quality of Distillation in Response-Based Network
Compression [0.0]
The performance of a compressed network is governed by the quality of distillation.
For a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher.
arXiv Detail & Related papers (2021-12-19T02:53:51Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Channel Distillation: Channel-Wise Attention for Knowledge Distillation [3.6269274596116476]
We propose a new distillation method, which contains two transfer distillation strategies and a loss decay strategy.
First, Channel Distillation (CD) transfers the channel information from the teacher to the student.
Second, Guided Knowledge Distillation (GKD) only enables the student to mimic the correct output of the teacher.
arXiv Detail & Related papers (2020-06-02T14:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.