HARE: HumAn pRiors, a key to small language model Efficiency
- URL: http://arxiv.org/abs/2406.11410v2
- Date: Tue, 18 Jun 2024 11:59:03 GMT
- Title: HARE: HumAn pRiors, a key to small language model Efficiency
- Authors: Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu,
- Abstract summary: Human priors play a crucial role in efficiently utilizing data in deep learning.
Existing Small Language Models mainly rely on web-scraped large-scale training data.
We propose a principle to leverage human priors for data construction.
- Score: 6.253561984966316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.
Related papers
- Training Data for Large Language Model [2.1178416840822027]
ChatGPT surpassed previous models in terms of parameters and the scale of its pretraining corpus.
ChatGPT achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data.
This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models.
arXiv Detail & Related papers (2024-11-12T11:09:58Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Iterative Data Smoothing: Mitigating Reward Overfitting and
Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values.
It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective.
This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z) - Self-Influence Guided Data Reweighting for Language Model Pre-training [46.57714637505164]
Language Models (LMs) pre-trained with self-supervision on large text corpora have become the default starting point for developing models for various NLP tasks.
All data samples in the corpus are treated with equal importance during LM pre-training.
Due to varying levels of relevance and quality of data, equal importance to all the data samples may not be the optimal choice.
We propose PRESENCE, a method for jointly reweighting samples by leveraging self-influence (SI) scores as an indicator of sample importance and pre-training.
arXiv Detail & Related papers (2023-11-02T01:00:46Z) - Federated Learning for Early Dropout Prediction on Healthy Ageing
Applications [0.0]
We present a federated machine learning (FML) approach that minimizes privacy concerns and enables distributed training, without transferring individual data.
Our results show that data selection and class imbalance handling techniques significantly improve the predictive accuracy of models trained under FML.
arXiv Detail & Related papers (2023-09-08T13:17:06Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of
Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora.
We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z) - Measuring Causal Effects of Data Statistics on Language Model's
`Factual' Predictions [59.284907093349425]
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models.
We provide a language for describing how training data influences predictions, through a causal framework.
Our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone.
arXiv Detail & Related papers (2022-07-28T17:36:24Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Neural Semi-supervised Learning for Text Classification Under
Large-Scale Pretraining [51.19885385587916]
We conduct studies on semi-supervised learning in the task of text classification under the context of large-scale LM pretraining.
Our work marks an initial step in understanding the behavior of semi-supervised learning models under the context of large-scale pretraining.
arXiv Detail & Related papers (2020-11-17T13:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.