Related papers: To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO

To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO

URL: http://arxiv.org/abs/2404.04575v3
Date: Sun, 16 Jun 2024 12:43:39 GMT
Title: To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO
Authors: Zi-Hao Qiu, Siqi Guo, Mao Xu, Tuo Zhao, Lijun Zhang, Tianbao Yang,
Abstract summary: We present a principled framework for learning a small yet generalizable temperature prediction network (TempNet) to improve LFMs. Our experiments on LLMs and CLIP models demonstrate that TempNet greatly improves the performance of existing solutions or models.
Score: 68.69840111477367
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The temperature parameter plays a profound role during training and/or inference with large foundation models (LFMs) such as large language models (LLMs) and CLIP models. Particularly, it adjusts the logits in the softmax function in LLMs, which is crucial for next token generation, and it scales the similarities in the contrastive loss for training CLIP models. A significant question remains: Is it viable to learn a neural network to predict a personalized temperature of any input data for enhancing LFMs"? In this paper, we present a principled framework for learning a small yet generalizable temperature prediction network (TempNet) to improve LFMs. Our solution is composed of a novel learning framework with a robust loss underpinned by constrained distributionally robust optimization (DRO), and a properly designed TempNet with theoretical inspiration. TempNet can be trained together with a large foundation model from scratch or learned separately given a pretrained foundation model. It is not only useful for predicting personalized temperature to promote the training of LFMs but also generalizable and transferable to new tasks. Our experiments on LLMs and CLIP models demonstrate that TempNet greatly improves the performance of existing solutions or models, e.g. Table 1. The code to reproduce the experimental results in this paper can be found at https://github.com/zhqiu/TempNet.

Related papers

Boosting Large Language Models with Mask Fine-Tuning [60.56962908455601]
We introduce Mask Fine-Tuning (MFT) to show that properly breaking the integrity of the model can surprisingly lead to improved performance. Experiments show that MFT gains a consistent performance boost across various domains and backbones.
arXiv Detail & Related papers (2025-03-27T20:17:57Z)
TabDPT: Scaling Tabular Foundation Models on Real Data [20.00390825519329]
We propose an approach to combine ICL-based retrieval with self supervised learning to train foundation models.<n>We show that incorporating real data during the pre-training phase can lead to significantly faster training and better generalization to unseen data.<n>Our resulting model, TabDPT, achieves top performance on both regression (CTR23) and classification (CC18) benchmarks.
arXiv Detail & Related papers (2024-10-23T18:00:00Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Large Language Model (LLM) pretraining traditionally relies on autoregressive language modeling on randomly sampled data blocks from web-scale datasets. We take inspiration from human learning techniques like spaced repetition to hypothesize that random data sampling for LLMs leads to high training cost and low quality models which tend to forget data. In order to effectively commit web-scale information to long-term memory, we propose the LFR (Learn, Focus, and Review) pedagogy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Mitigating Noise Detriment in Differentially Private Federated Learning with Model Pre-training [27.1846697092374]
Pre-training exploits public datasets to pre-train an advanced machine learning model. We are the first to explore how model pre-training can mitigate noise detriment in differentially private federated learning.
arXiv Detail & Related papers (2024-08-18T13:48:10Z)
Save It All: Enabling Full Parameter Tuning for Federated Large Language Models via Cycle Block Gradient Descent [15.463595798992621]
Large language models (LLMs) have revolutionized the deep learning paradigm, yielding impressive results across a wide array of tasks. Existing solutions make the unrealistic assumption that the entire model is exchanged for training. We introduce a novel method for the efficient training and fine-tuning of LLMs in FL, with minimal resource consumption.
arXiv Detail & Related papers (2024-06-17T03:49:44Z)
A Survey on Efficient Federated Learning Methods for Foundation Model Training [62.473245910234304]
Federated Learning (FL) has become an established technique to facilitate privacy-preserving collaborative training across a multitude of clients. In the wake of Foundation Models (FM), the reality is different for many deep learning applications. We discuss the benefits and drawbacks of parameter-efficient fine-tuning (PEFT) for FL applications.
arXiv Detail & Related papers (2024-01-09T10:22:23Z)
Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training [58.20089993899729]
This paper proposes TempBalance, a straightforward yet effective layerwise learning rate method. We show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art metrics and schedulers.
arXiv Detail & Related papers (2023-12-01T05:38:17Z)
Federated Learning over Hierarchical Wireless Networks: Training Latency Minimization via Submodel Partitioning [15.311309249848739]
Hierarchical independent submodel training (HIST) is a new FL methodology that aims to address these issues in hierarchical cloud-edge-client networks. We demonstrate how HIST can be augmented with over-the-air computation (AirComp) to further enhance the efficiency of the model aggregation over the edge cells.
arXiv Detail & Related papers (2023-10-27T04:42:59Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model. In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side. We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.