Related papers: LIMA: Less Is More for Alignment

LIMA: Less Is More for Alignment

URL: http://arxiv.org/abs/2305.11206v1
Date: Thu, 18 May 2023 17:45:22 GMT
Title: LIMA: Less Is More for Alignment
Authors: Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy
Abstract summary: We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
Score: 112.93890201395477
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

Related papers

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [84.03928547166873]
Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget.
arXiv Detail & Related papers (2025-04-10T23:22:43Z)
LIMO: Less is More for Reasoning [23.312893016642096]
We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples.<n>Our model, LIMO, achieves 63.3% accuracy on AIME24 and 95.6% on MATH500.<n>LIMO exhibits strong out-of-distribution generalization, achieving a 45.8% absolute improvement across diverse benchmarks.
arXiv Detail & Related papers (2025-02-05T17:23:45Z)
Training Dynamics of a 1.7B LLaMa Model: A Data-Efficient Approach [10.39475177812483]
We share insights gained from training DMaS-LLaMa-Lite on approximately 20 billion tokens of data. We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output. By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies.
arXiv Detail & Related papers (2024-12-17T21:15:52Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset. Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
InternLM2 Technical Report [159.70692271378581]
This paper introduces InternLM2, an open-source Large Language Models (LLMs) that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-03-26T00:53:24Z)
An Emulator for Fine-Tuning Large Language Models using Small Language Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z)
Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI [0.8889304968879164]
We investigate the ability of pre-trained language models to generalize to different non-language tasks. The four pre-trained models that we used, T5, BART, BERT, and GPT-2 achieve outstanding results.
arXiv Detail & Related papers (2023-06-21T11:55:17Z)
UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks [60.46473247205654]
Using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models. Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2023-06-07T18:26:22Z)
Instruction Tuned Models are Quick Learners [20.771930945083994]
In this work, we demonstrate the sample efficiency of instruction tuned models over various tasks. In the STL setting, instruction tuned models equipped with 25% of the downstream train data surpass the SOTA performance on the downstream tasks. In the MTL setting, an instruction tuned model trained on only 6% of downstream training data achieve SOTA, while using 100% of the training data results in a 3.69% points improvement.
arXiv Detail & Related papers (2023-05-17T22:30:01Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models. Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z)
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks. We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.