Improved Knowledge Distillation via Full Kernel Matrix Transfer
- URL: http://arxiv.org/abs/2009.14416v2
- Date: Tue, 29 Mar 2022 18:14:55 GMT
- Title: Improved Knowledge Distillation via Full Kernel Matrix Transfer
- Authors: Qi Qian, Hao Li, Juhua Hu
- Abstract summary: Knowledge distillation is an effective way for model compression in deep learning.
We decompose the original full matrix with Nystr"om method.
Compared with the full matrix, the size of the partial matrix is linear in the number of examples.
- Score: 21.533095275253466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is an effective way for model compression in deep
learning. Given a large model (i.e., teacher model), it aims to improve the
performance of a compact model (i.e., student model) by transferring the
information from the teacher. Various information for distillation has been
studied. Recently, a number of works propose to transfer the pairwise
similarity between examples to distill relative information. However, most of
efforts are devoted to developing different similarity measurements, while only
a small matrix consisting of examples within a mini-batch is transferred at
each iteration that can be inefficient for optimizing the pairwise similarity
over the whole data set. In this work, we aim to transfer the full similarity
matrix effectively. The main challenge is from the size of the full matrix that
is quadratic to the number of examples. To address the challenge, we decompose
the original full matrix with Nystr{\"{o}}m method. By selecting appropriate
landmark points, our theoretical analysis indicates that the loss for transfer
can be further simplified. Concretely, we find that the difference between the
original full kernel matrices between teacher and student can be well bounded
by that of the corresponding partial matrices, which only consists of
similarities between original examples and landmark points. Compared with the
full matrix, the size of the partial matrix is linear in the number of
examples, which improves the efficiency of optimization significantly. The
empirical study on benchmark data sets demonstrates the effectiveness of the
proposed algorithm. Code is available at \url{https://github.com/idstcv/KDA}.
Related papers
- Optimal Matrix-Mimetic Tensor Algebras via Variable Projection [0.0]
Matrix mimeticity arises from interpreting tensors as operators that can be multiplied, factorized, and analyzed analogous to matrices.
We learn optimal linear mappings and corresponding tensor representations without relying on prior knowledge of the data.
We provide original theory of uniqueness of the transformation and convergence analysis of our variable-projection-based algorithm.
arXiv Detail & Related papers (2024-06-11T04:52:23Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Multiresolution kernel matrix algebra [0.0]
We show the compression of kernel matrices by means of samplets produces optimally sparse matrices in a certain S-format.
The inverse of a kernel matrix (if it exists) is compressible in the S-format as well.
The matrix algebra is justified mathematically by pseudo differential calculus.
arXiv Detail & Related papers (2022-11-21T17:50:22Z) - Doubly Deformable Aggregation of Covariance Matrices for Few-shot
Segmentation [25.387090319723715]
Training semantic segmentation models with few annotated samples has great potential in various real-world applications.
For the few-shot segmentation task, the main challenge is how to accurately measure the semantic correspondence between the support and query samples.
We propose to aggregate the learnable covariance matrices with a deformable 4D Transformer to effectively predict the segmentation map.
arXiv Detail & Related papers (2022-07-30T20:41:38Z) - Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue.
We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues.
Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z) - Matrix Completion via Non-Convex Relaxation and Adaptive Correlation
Learning [90.8576971748142]
We develop a novel surrogate that can be optimized by closed-form solutions.
We exploit upperwise correlation for completion, and thus an adaptive correlation learning model.
arXiv Detail & Related papers (2022-03-04T08:50:50Z) - Fast Differentiable Matrix Square Root and Inverse Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root and the inverse square root.
For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad'e Approximants (MPA)
A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration.
arXiv Detail & Related papers (2022-01-29T10:00:35Z) - Sublinear Time Approximation of Text Similarity Matrices [50.73398637380375]
We introduce a generalization of the popular Nystr"om method to the indefinite setting.
Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix.
We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices.
arXiv Detail & Related papers (2021-12-17T17:04:34Z) - Test Set Sizing Via Random Matrix Theory [91.3755431537592]
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression.
It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise.
This paper is the first to solve for the training and test size for any model in a way that is truly optimal.
arXiv Detail & Related papers (2021-12-11T13:18:33Z) - Statistical limits of dictionary learning: random matrix theory and the
spectral replica method [28.54289139061295]
We consider increasingly complex models of matrix denoising and dictionary learning in the Bayes-optimal setting.
We introduce a novel combination of the replica method from statistical mechanics together with random matrix theory, coined spectral replica method.
arXiv Detail & Related papers (2021-09-14T12:02:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.