Related papers: ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation

Related papers

TD-MPC-Opt: Distilling Model-Based Multi-Task Reinforcement Learning Agents [1.6574413179773757]
We present a novel approach to knowledge transfer in model-based reinforcement learning.<n>Our method efficiently distills a high-capacity multi-task agent into a compact model.<n>Our approach addresses practical deployment limitations and offers insights into knowledge representation in large world models.
arXiv Detail & Related papers (2025-07-02T15:38:49Z)
REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback [9.374858922055257]
Large Language Models (LLMs) have proven effective in numerous few-shot or zero-shot Natural Language Processing (NLP) tasks.<n>Previous research endeavors have attempted to address this challenge by proposing frameworks capable of generating instructions directly from the model itself.<n>This paper explores the performance of three open-source small LLMs using a semi-automated framework.
arXiv Detail & Related papers (2025-05-10T07:23:19Z)
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [84.03928547166873]
Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget.
arXiv Detail & Related papers (2025-04-10T23:22:43Z)
A surprisal oracle for when every layer counts [2.5716627278119444]
Active Curriculum Language Modeling (ACLM) is a learner directed approach to training a language model. We propose an updated ACLM process for the BabyLM 2024 task.
arXiv Detail & Related papers (2024-12-04T07:53:45Z)
Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation [31.61985215677114]
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities. This is a preprint technical report with thorough evaluations to understand the entire process.
arXiv Detail & Related papers (2024-06-21T08:29:31Z)
Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways [14.480574407610424]
We present Lil-Bevo, our submission to the BabyLM Challenge. Our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data. Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting.
arXiv Detail & Related papers (2023-10-26T17:13:07Z)
BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS) We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z)
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty [0.0]
We trained an ensemble consisting of a GPT-2 and small LLaMA models on a developmentally-plausible, 10M-word BabyLM dataset. We distilled it into a small, 58M- parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation.
arXiv Detail & Related papers (2023-08-03T20:20:01Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
ZhichunRoad at Amazon KDD Cup 2022: MultiTask Pre-Training for E-Commerce Product Search [4.220439000486713]
We propose a robust multilingual model to improve the quality of search results. In pre-training stage, we adopt mlm task, classification task and contrastive learning task. In fine-tuning stage, we use confident learning, exponential moving average method (EMA), adversarial training (FGM) and regularized dropout strategy (R-Drop)
arXiv Detail & Related papers (2023-01-31T07:31:34Z)
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model. We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO) The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z)
The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task. We trained our models with the officially provided ASR and MT datasets. To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.