Related papers: HARE: HumAn pRiors, a key to small language model Efficiency

HARE: HumAn pRiors, a key to small language model Efficiency

URL: http://arxiv.org/abs/2406.11410v2
Date: Tue, 18 Jun 2024 11:59:03 GMT
Title: HARE: HumAn pRiors, a key to small language model Efficiency
Authors: Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu,
Abstract summary: Human priors play a crucial role in efficiently utilizing data in deep learning. Existing Small Language Models mainly rely on web-scraped large-scale training data. We propose a principle to leverage human priors for data construction.
Score: 6.253561984966316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.

Related papers

Evolution without Large Models: Training Language Model with Task Principles [52.44569608690695]
A common training approach for language models involves using a large-scale language model to expand a human-provided dataset.<n>This method significantly reduces training costs by eliminating the need for extensive human data annotation.<n>However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakage.
arXiv Detail & Related papers (2025-07-08T13:52:45Z)
Reasoning to Learn from Latent Thoughts [45.59740535714148]
We show that explicitly modeling and inferring the latent thoughts that underlie the text generation process can significantly improve pretraining data efficiency. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.
arXiv Detail & Related papers (2025-03-24T16:41:23Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance. We introduce novel algorithms for dynamic, instance-level data reweighting. Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Correcting Large Language Model Behavior via Influence Function [44.090990384733324]
The dynamic nature of human preferences can render some prior training data outdated or even erroneous. We propose a novel approach, Large Language Model Behavior Correction with Influence Function Recall and Post-Training (LANCET) LANCET consists of two phases: (1) using influence functions to identify the training data that significantly impact undesirable model outputs, and (2) applying an Influence function-driven Bregman Optimization (IBO) technique to adjust the model's behavior.
arXiv Detail & Related papers (2024-12-21T02:50:08Z)
Training Data for Large Language Model [2.1178416840822027]
ChatGPT surpassed previous models in terms of parameters and the scale of its pretraining corpus. ChatGPT achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models.
arXiv Detail & Related papers (2024-11-12T11:09:58Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset. Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. Existing approaches require re-training models on different data subsets, which is computationally intensive. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z)
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z)
Self-Influence Guided Data Reweighting for Language Model Pre-training [46.57714637505164]
Language Models (LMs) pre-trained with self-supervision on large text corpora have become the default starting point for developing models for various NLP tasks. All data samples in the corpus are treated with equal importance during LM pre-training. Due to varying levels of relevance and quality of data, equal importance to all the data samples may not be the optimal choice. We propose PRESENCE, a method for jointly reweighting samples by leveraging self-influence (SI) scores as an indicator of sample importance and pre-training.
arXiv Detail & Related papers (2023-11-02T01:00:46Z)
Federated Learning for Early Dropout Prediction on Healthy Ageing Applications [0.0]
We present a federated machine learning (FML) approach that minimizes privacy concerns and enables distributed training, without transferring individual data. Our results show that data selection and class imbalance handling techniques significantly improve the predictive accuracy of models trained under FML.
arXiv Detail & Related papers (2023-09-08T13:17:06Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora. We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z)
Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions [59.284907093349425]
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. We provide a language for describing how training data influences predictions, through a causal framework. Our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone.
arXiv Detail & Related papers (2022-07-28T17:36:24Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining [51.19885385587916]
We conduct studies on semi-supervised learning in the task of text classification under the context of large-scale LM pretraining. Our work marks an initial step in understanding the behavior of semi-supervised learning models under the context of large-scale pretraining.
arXiv Detail & Related papers (2020-11-17T13:39:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.