Related papers: FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation

FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation

URL: http://arxiv.org/abs/2407.07093v1
Date: Tue, 9 Jul 2024 17:59:48 GMT
Title: FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
Authors: Liqun Ma, Mingjie Sun, Zhiqiang Shen,
Abstract summary: This work presents a Fully BInarized Large Language Model (FBI-LLM) It demonstrates for the first time how to train a large-scale binary language model from scratch.
Score: 32.01836613286288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the performance of its full-precision counterparts (e.g., FP16 or BF16) in transformer-based LLMs. It achieves this by employing an autoregressive distillation (AD) loss with maintaining equivalent model dimensions (130M, 1.3B, 7B) and training data volume as regular LLM pretraining, while delivering competitive results in terms of perplexity and task-specific effectiveness. Intriguingly, by analyzing the training trajectory, we find that the pretrained weight is not necessary for training binarized LLMs from scratch. This research encourages a new computational framework and may facilitate the future design of specialized hardware tailored for fully 1-bit LLMs. We make all models, code, and training dataset fully accessible and transparent to support further research (Code: https://github.com/LiqunMa/FBI-LLM. Model: https://huggingface.co/LiqunMa/).

Related papers

GPT Carry-On: Training Foundation Model for Customization Could Be Simple, Scalable and Affordable [1.79487674052027]
We propose a framework to take full advantages of existing large language foundation models (LLM) We train an additional branch of transformer blocks on the final-layer embedding of pretrained LLMs, which is the base, then a carry-on module merge the base models to compose a customized LLM. As the base model don't need to update parameters, we are able to outsource most computation of the training job on inference nodes.
arXiv Detail & Related papers (2025-04-10T07:15:40Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
The Future of Large Language Model Pre-training is Federated [15.237418036900582]
We propose a scalable deployment system called Photon to enable the investigation and development of this new training paradigm for LLM pre-training. We show that Photon can be used by organizations interested in collaborating with their private data sources and computational resources for pre-training LLMs with billions of parameters. We further show the effectiveness of the federated training scales with model size and present our approach for training billion-scale federated LLMs using limited resources.
arXiv Detail & Related papers (2024-05-17T15:27:52Z)
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58. The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs. It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework [10.656788279434798]
We propose a simple and efficient learning framework, TLM, that does not rely on large-scale pretraining. On eight classification datasets in four domains, TLM achieves results better than or similar to pretrained language models.
arXiv Detail & Related papers (2021-11-07T17:13:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.