Related papers: An Inquiry into Datacenter TCO for LLM Inference with FP8

An Inquiry into Datacenter TCO for LLM Inference with FP8

URL: http://arxiv.org/abs/2502.01070v3
Date: Tue, 29 Apr 2025 10:17:15 GMT
Title: An Inquiry into Datacenter TCO for LLM Inference with FP8
Authors: Jiwoo Kim, Joonhyung Lee, Gunho Park, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee, Youngjoo Lee,
Abstract summary: We analyze the computational characteristics and constraints of large language models (LLMs) inference from a TCO perspective.<n>We present a generalizable framework that enables CSPs to compare and select AI accelerators according to diverse operational requirements.
Score: 7.910301381209274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) continue to scale, their inference demands present significant challenges, particularly due to the high power consumption of AI accelerators in datacenters. These facilities require specialized cooling and power management systems, substantially increasing the total cost of ownership (TCO) for cloud service providers (CSPs). In this work, we analyze the computational characteristics and constraints of LLM inference from a TCO perspective, focusing on two representative accelerators: the Gaudi 2 and NVIDIA H100. We present a generalizable framework that enables CSPs to compare and select AI accelerators according to diverse operational requirements. Using this model, we analyze the impact of FP8 precision and LLM inference workload characteristics as key factors influencing TCO. We investigate FP8 quantization, which is gaining adoption in LLM training, as a technique to improve inference throughput while maintaining cost efficiency. Furthermore, our analysis of LLM inference workloads reveals that performance on thin GEMMs, which dominate the decode phase, can have a greater impact than theoretical hardware peak performance. By studying the interaction between power consumption, quantization strategies, and hardware architecture, we offer insights that support informed deployment decisions and guide future accelerator designs to improve the TCO of LLM inference.

Related papers

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
Scaling Intelligence: Designing Data Centers for Next-Gen Language Models [0.13332839594069593]
Large Language Models (LLMs) demand a radical rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness.<n>Our work provides a comprehensive co-design framework that jointly explores FLOPS, bandwidth and capacity, multiple network topologies, and popular parallelism/optimization strategies.<n>Our findings offer actionable insights and a practical roadmap for designing AI data centers.
arXiv Detail & Related papers (2025-06-17T22:29:37Z)
ORMind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research [53.736407871322314]
We introduce ORMind, a cognitive-inspired framework that enhances optimization through counterfactual reasoning.<n>Our approach emulates human cognition, implementing an end-to-end workflow that transforms requirements into mathematical models and executable code.<n>It is currently being tested internally in Lenovo's AI Assistant, with plans to enhance optimization capabilities for both business and consumer customers.
arXiv Detail & Related papers (2025-06-02T05:11:21Z)
Deploying Large AI Models on Resource-Limited Devices with Split Federated Learning [39.73152182572741]
This paper proposes a novel framework, named Quantized Split Federated Fine-Tuning Large AI Model (SFLAM) By partitioning the training load between edge devices and servers, SFLAM can facilitate the operation of large models on devices. SFLAM incorporates quantization management, power control, and bandwidth allocation strategies to enhance training efficiency.
arXiv Detail & Related papers (2025-04-12T07:55:11Z)
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency [6.306413686006502]
We conduct a comprehensive analysis of 28 quantized Large Language Models (LLMs) from the Ollama library. We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings.
arXiv Detail & Related papers (2025-04-04T11:29:30Z)
Cost-Optimal Grouped-Query Attention for Long-Context LLMs [64.90662568387683]
Building effective Transformer-based large language models (LLMs) has recently become a research focus. We compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Our studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs.
arXiv Detail & Related papers (2025-03-12T17:50:42Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge. Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings [1.5749416770494706]
Large language models (LLMs) have shown significant improvements in many natural language processing (NLP) tasks. LLMs are resource-intensive, requiring extensive computational resources both during training and inference. As their adoption accelerates, the sustainability of LLMs has become a critical issue.
arXiv Detail & Related papers (2025-01-14T16:02:33Z)
Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets. This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z)
eFedLLM: Efficient LLM Inference Based on Federated Learning [1.6179784294541053]
Large Language Models (LLMs) herald a transformative era in artificial intelligence (AI) This paper introduces an effective approach that enhances the operational efficiency and affordability of LLM inference.
arXiv Detail & Related papers (2024-11-24T22:50:02Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks. We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
Scaling FP8 training to trillion-token LLMs [26.195547788434908]
We train large language models using FP8 precision on datasets up to 2 trillion tokens. We uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function.
arXiv Detail & Related papers (2024-09-19T07:15:58Z)
Interpreting and Improving Large Language Models in Arithmetic Calculation [72.19753146621429]
Large language models (LLMs) have demonstrated remarkable potential across numerous applications. In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations. We investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs' computational performance.
arXiv Detail & Related papers (2024-09-03T07:01:46Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Assessing Economic Viability: A Comparative Analysis of Total Cost of Ownership for Domain-Adapted Large Language Models versus State-of-the-art Counterparts in Chip Design Coding Assistance [10.364901568556435]
This paper presents a comparative analysis of total cost of ownership (TCO) and performance between domain-adapted large language models (LLM) and state-of-the-art (SoTA) LLMs.
arXiv Detail & Related papers (2024-04-12T23:37:56Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z)
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs) For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z)
FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings. E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.