On the Efficacy of Small Self-Supervised Contrastive Models without
Distillation Signals
- URL: http://arxiv.org/abs/2107.14762v1
- Date: Fri, 30 Jul 2021 17:10:05 GMT
- Title: On the Efficacy of Small Self-Supervised Contrastive Models without
Distillation Signals
- Authors: Haizhou Shi, Youcai Zhang, Siliang Tang, Wenjie Zhu, Yaqian Li,
Yandong Guo, Yueting Zhuang
- Abstract summary: Small models perform quite poorly under the paradigm of self-supervised contrastive learning.
Existing methods usually adopt a large off-the-shelf model to transfer knowledge to the small one via knowledge distillation.
Despite their effectiveness, distillation-based methods may not be suitable for some resource-restricted scenarios.
- Score: 44.209171209780365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is a consensus that small models perform quite poorly under the paradigm
of self-supervised contrastive learning. Existing methods usually adopt a large
off-the-shelf model to transfer knowledge to the small one via knowledge
distillation. Despite their effectiveness, distillation-based methods may not
be suitable for some resource-restricted scenarios due to the huge
computational expenses of deploying a large model. In this paper, we study the
issue of training self-supervised small models without distillation signals. We
first evaluate the representation spaces of the small models and make two
non-negligible observations: (i) small models can complete the pretext task
without overfitting despite its limited capacity; (ii) small models universally
suffer the problem of over-clustering. Then we verify multiple assumptions that
are considered to alleviate the over-clustering phenomenon. Finally, we combine
the validated techniques and improve the baseline of five small architectures
with considerable margins, which indicates that training small self-supervised
contrastive models is feasible even without distillation signals.
Related papers
- Small Models Struggle to Learn from Strong Reasoners [14.895026967556088]
Small models do not consistently benefit from long chain-of-thought reasoning or distillation from larger models.
We propose Mix Distillation, a strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models.
Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone.
arXiv Detail & Related papers (2025-02-17T18:56:15Z) - LoRE-Merging: Exploring Low-Rank Estimation For Large Language Model Merging [10.33844295243509]
We propose a unified framework for model merging based on low-rank estimation of task vectors without the need for access to the base model, named textscLoRE-Merging.
Our approach is motivated by the observation that task vectors from fine-tuned models frequently exhibit a limited number of dominant singular values, making low-rank estimations less prone to interference.
arXiv Detail & Related papers (2025-02-15T10:18:46Z) - Adversarial Transferability in Deep Denoising Models: Theoretical Insights and Robustness Enhancement via Out-of-Distribution Typical Set Sampling [6.189440665620872]
Deep learning-based image denoising models demonstrate remarkable performance, but their lack of robustness analysis remains a significant concern.
A major issue is that these models are susceptible to adversarial attacks, where small, carefully crafted perturbations to input data can cause them to fail.
We propose a novel adversarial defense method: the Out-of-Distribution Typical Set Sampling Training strategy.
arXiv Detail & Related papers (2024-12-08T13:47:57Z) - On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models [7.062887337934677]
We propose that small models may not need to absorb the cost of pre-training to reap its benefits.
We observe that, when distilled on a task from a pre-trained model, a small model can achieve or surpass the performance it would achieve if it was pre-trained then finetuned on that task.
arXiv Detail & Related papers (2024-04-04T07:38:11Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - Investigating Ensemble Methods for Model Robustness Improvement of Text
Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases.
By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z) - Predicting on the Edge: Identifying Where a Larger Model Does Better [61.793778186198864]
We show that large models have the largest improvement on examples where the small model is most uncertain.
We show that a switcher model which defers examples to a larger model when a small model is uncertain can achieve striking improvements in performance and resource usage.
arXiv Detail & Related papers (2022-02-15T18:53:14Z) - When in Doubt, Summon the Titans: Efficient Inference with Large Models [80.2673230098021]
We propose a two-stage framework based on distillation that realizes the modelling benefits of large models.
We use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples.
Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference.
arXiv Detail & Related papers (2021-10-19T22:56:49Z) - Learning Diverse Representations for Fast Adaptation to Distribution
Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task.
We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.