Matmul or No Matmal in the Era of 1-bit LLMs
- URL: http://arxiv.org/abs/2408.11939v2
- Date: Wed, 28 Aug 2024 19:51:04 GMT
- Title: Matmul or No Matmal in the Era of 1-bit LLMs
- Authors: Jinendra Malekar, Mohammed E. Elbtity, Ramtin Zand,
- Abstract summary: 1-bit large language models (LLMs) have attracted considerable attention and opened up new research opportunities.
However, 1-bit LLMs only improve a fraction of models by applying extreme quantization to the projection layers.
In this work, we present an adaptation of Amdahl's Law tailored for the 1-bit LLM context.
- Score: 0.48212500317840945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of 1-bit large language models (LLMs) has attracted considerable attention and opened up new research opportunities. However, 1-bit LLMs only improve a fraction of models by applying extreme quantization to the projection layers while leaving attention heads unchanged. Therefore, to avoid fundamentally wrong choices of goals in future research, it is crucial to understand the actual improvements in computation and memory usage that 1-bit LLMs can deliver. In this work, we present an adaptation of Amdahl's Law tailored for the 1-bit LLM context, which illustrates how partial improvements in 1-bit LLMs impact overall model performance. Through extensive experiments, we uncover key nuances across different model architectures and hardware configurations, offering a roadmap for future research in the era of 1-bit LLMs.
Related papers
- Performance Law of Large Language Models [58.32539851241063]
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources.
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
arXiv Detail & Related papers (2024-08-19T11:09:12Z) - An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs [54.91212829143966]
This study explores LLaMA3's capabilities when quantized to low bit-width.
We evaluate 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets.
Our experimental results indicate that LLaMA3 still suffers non-negligent degradation in linguistic and visual contexts.
arXiv Detail & Related papers (2024-04-22T10:03:03Z) - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58.
The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs.
It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward [29.81212051279456]
Recent advancements in model compression and system-level optimization methods aim to enhance LLM inference.
This survey offers an overview of these methods, emphasizing recent developments.
arXiv Detail & Related papers (2024-02-02T06:29:34Z) - OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models [44.848642930797155]
We release OpenMoE, a series of fully open-sourced and reproducible decoder-only Mixture-of-Experts (MoE) based large language models (LLMs)
Our investigation confirms that MoE-based LLMs can offer a more favorable cost-effectiveness trade-off than dense LLMs.
We find that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance.
arXiv Detail & Related papers (2024-01-29T12:05:02Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.