Related papers: LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences

LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences

URL: http://arxiv.org/abs/2410.02950v1
Date: Thu, 3 Oct 2024 19:48:45 GMT
Title: LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences
Authors: Zhenxiao Fu, Fan Chen, Shan Zhou, Haitong Li, Lei Jiang,
Abstract summary: Estimating the carbon footprint of large language models (LLM) inferences is more complex than training. coo is a graph neural network (GNN)-based model that greatly improves the accuracy of LLM inference carbon footprint predictions.
Score: 7.137654106298203
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Throughout its lifecycle, a large language model (LLM) generates a substantially larger carbon footprint during inference than training. LLM inference requests vary in batch size, prompt length, and token generation number, while cloud providers employ different GPU types and quantities to meet diverse service-level objectives for accuracy and latency. It is crucial for both users and cloud providers to have a tool that quickly and accurately estimates the carbon impact of LLM inferences based on a combination of inference request and hardware configurations before execution. Estimating the carbon footprint of LLM inferences is more complex than training due to lower and highly variable model FLOPS utilization, rendering previous equation-based models inaccurate. Additionally, existing machine learning (ML) prediction methods either lack accuracy or demand extensive training data, as they inadequately handle the distinct prefill and decode phases, overlook hardware-specific features, and inefficiently sample uncommon inference configurations. We introduce \coo, a graph neural network (GNN)-based model that greatly improves the accuracy of LLM inference carbon footprint predictions compared to previous methods.

Related papers

PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints. PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint. evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z)
Cost-Optimal Grouped-Query Attention for Long-Context LLMs [64.90662568387683]
Building effective Transformer-based large language models (LLMs) has recently become a research focus. We compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Our studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs.
arXiv Detail & Related papers (2025-03-12T17:50:42Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Large Language Models are Few-shot Multivariate Time Series Classifiers [23.045734479292356]
Large Language Models (LLMs) have been extensively applied in time series analysis. Yet, their utility in the few-shot classification (i.e., a crucial training scenario) is underexplored. We aim to leverage the extensive pre-trained knowledge in LLMs to overcome the data scarcity problem.
arXiv Detail & Related papers (2025-01-30T03:59:59Z)
Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation. We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources. We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z)
Getting the most out of your tokenizer for pre-training and domain adaptation [26.427537023771844]
We show that the size, pre-tokenization regular expression, and training data of a tokenizer can significantly impact the model's generation speed. We specialize the tokenizer of a pre-trained LLM to obtain large gains in generation speed and effective context size.
arXiv Detail & Related papers (2024-02-01T21:49:34Z)
LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models [7.132822974156601]
The carbon footprint of large language models (LLMs) is a significant concern, encompassing emissions from their training, inference, experimentation, and storage processes. We introduce textitcarb, an end-to-end carbon footprint projection model designed for both dense and MoE LLMs.
arXiv Detail & Related papers (2023-09-25T14:50:04Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales [65.01417261415833]
We present an approach to predict the pre-training loss based on our observations that Maximal Update Parametrization (muP) enables accurate fitting of scaling laws. With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B. Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models.
arXiv Detail & Related papers (2023-04-14T00:45:01Z)
Sequential Learning Of Neural Networks for Prequential MDL [18.475866691786695]
We evaluate approaches for computing prequential description lengths for image classification datasets with neural networks. Considering the computational cost, we find that online-learning with rehearsal has favorable performance. We present description lengths for a suite of image classification datasets that improve upon previously reported results by large margins.
arXiv Detail & Related papers (2022-10-14T16:30:23Z)
LEAPER: Modeling Cloud FPGA-based Systems via Transfer Learning [13.565689665335697]
We propose LEAPER, a transfer learning-based approach for FPGA-based systems that adapts an existing ML-based model to a new, unknown environment. Results show that our approach delivers, on average, 85% accuracy when we use our transferred model for prediction in a cloud environment with 5-shot learning.
arXiv Detail & Related papers (2022-08-22T21:25:56Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model. This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs) The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.