Related papers: VecTrans: LLM Transformation Framework for Better Auto-vectorization on High-performance CPU

VecTrans: LLM Transformation Framework for Better Auto-vectorization on High-performance CPU

URL: http://arxiv.org/abs/2503.19449v1
Date: Tue, 25 Mar 2025 08:39:35 GMT
Title: VecTrans: LLM Transformation Framework for Better Auto-vectorization on High-performance CPU
Authors: Zhongchun Zheng, Long Cheng, Lu Li, Rodrigo C. O. Rocha, Tianyi Liu, Wei Wei, Xianwei Zhang, Yaoqing Gao,
Abstract summary: VecTrans is a framework that leverages large language models to enhance compiler-based code vectorization.<n>VecTrans successfully vectorizes 23 cases (46%) and achieves an average speedup of 2.02x.
Score: 17.263612093919885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated great capabilities in code generation, yet their effective application in compiler optimizations remains an open challenge due to issues such as hallucinations and a lack of domain-specific reasoning. Vectorization, a crucial optimization for enhancing code performance, often fails because of the compiler's inability to recognize complex code patterns, which commonly require extensive empirical expertise. LLMs, with their ability to capture intricate patterns, thus providing a promising solution to this challenge. This paper presents VecTrans, a novel framework that leverages LLMs to enhance compiler-based code vectorization. VecTrans first employs compiler analysis to identify potentially vectorizable code regions. It then utilizes an LLM to refactor these regions into patterns that are more amenable to the compiler's auto-vectorization. To ensure semantic correctness, VecTrans further integrates a hybrid validation mechanism at the intermediate representation (IR) level. With the above efforts, VecTrans combines the adaptability of LLMs with the precision of compiler vectorization, thereby effectively opening up the vectorization opportunities. Experimental results show that among all 50 TSVC functions unvectorizable by Clang, GCC, and BiShengCompiler, VecTrans successfully vectorizes 23 cases (46%) and achieves an average speedup of 2.02x, greatly surpassing state-of-the-art performance.

Related papers

SymRTLO: Enhancing RTL Code Optimization with LLMs and Neuron-Inspired Symbolic Reasoning [18.40402135952776]
This paper presents SymRTLO, a novel neuron-symbolic RTL optimization framework. A symbolic module is proposed for analyzing and optimizing finite state machine (FSM) logic. Experiments on the RTL-Rewriter benchmark with Synopsys Design Compiler and Yosys show that SymRTLO improves power, performance, and area (PPA) by up to 43.9%, 62.5%, and 51.1%, respectively.
arXiv Detail & Related papers (2025-04-14T16:15:55Z)
Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation.<n>Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process.<n>We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z)
Vector-ICL: In-context Learning with Continuous Vector Representations [75.96920867382859]
Large language models (LLMs) have shown remarkable in-context learning capabilities on textual data.<n>We explore whether these capabilities can be extended to continuous vectors from diverse domains, obtained from black-box pretrained encoders.<n>In particular, we find that pretraining projectors with general language modeling objectives enables Vector-ICL.
arXiv Detail & Related papers (2024-10-08T02:25:38Z)
LLM-Aided Compilation for Tensor Accelerators [6.709490736813537]
We discuss how large language models (LLMs) could be leveraged to build a compiler for hardware accelerators. Specifically, we demonstrate the ability of GPT-4 to achieve high pass rates in translating code to the Gemmini accelerator. We also propose a 2-phase workflow for utilizing LLMs to generate hardware-optimized code.
arXiv Detail & Related papers (2024-08-06T19:10:25Z)
Should AI Optimize Your Code? A Comparative Study of Classical Optimizing Compilers Versus Current Large Language Models [0.0]
Large Language Models (LLMs) raise intriguing questions about the potential of these AI approaches to revolutionize code optimization. This work aims to answer an essential question for the compiler community: "Can AI-driven models revolutionize the way we approach code optimization?" We present a comparative analysis between three classical optimizing compilers and two recent large language models.
arXiv Detail & Related papers (2024-06-17T23:26:41Z)
LLM-Vectorizer: LLM-based Verified Loop Vectorizer [12.048697450464935]
Large-language models (LLMs) can generate vectorized code from scalar programs that process individual array elements. LLMs are capable of producing high performance vectorized code with run-time speedup ranging from 1.1x to 9.4x. Our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset.
arXiv Detail & Related papers (2024-06-07T07:04:26Z)
Make Every Move Count: LLM-based High-Quality RTL Code Generation Using MCTS [20.135906487081453]
We present an automated transformer decoding algorithm that integrates Monte Carlo tree-search for lookahead. For the largest design generated by the state-of-the-art LLM (16-bit adder), our technique can achieve a 31.8% improvement in the area-delay product.
arXiv Detail & Related papers (2024-02-05T18:47:04Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding. We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z)
Transpiling RTL Pseudo-code of the POWER Instruction Set Architecture to C for Real-time Performance Analysis on Cavatools Simulator [0.0]
This paper presents a transpiler framework for converting RTL pseudo code of the POWER Instruction Set Architecture (ISA) to C code. The transpiler ensures compatibility with the Cavatools simulator by generating C code that adheres to its requirements. The proposed framework facilitates the seamless integration of RTL pseudo code into the Cavatools ecosystem, enabling comprehensive performance analysis and optimization of Power ISA-based code.
arXiv Detail & Related papers (2023-06-14T18:53:14Z)
Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references. It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z)
Factorizers for Distributed Sparse Block Codes [45.29870215671697]
We propose a fast and highly accurate method for factorizing distributed block codes (SBCs) Our iterative factorizer introduces a threshold-based nonlinear activation, conditional random sampling, and an $ell_infty$-based similarity metric. We demonstrate the feasibility of our method on four deep CNN architectures over CIFAR-100, ImageNet-1K, and RAVEN datasets.
arXiv Detail & Related papers (2023-03-24T12:31:48Z)
Enabling Retargetable Optimizing Compilers for Quantum Accelerators via a Multi-Level Intermediate Representation [78.8942067357231]
We present a multi-level quantum-classical intermediate representation (IR) that enables an optimizing, retargetable, ahead-of-time compiler. We support the entire gate-based OpenQASM 3 language and provide custom extensions for common quantum programming patterns and improved syntax. Our work results in compile times that are 1000x faster than standard Pythonic approaches, and 5-10x faster than comparative standalone quantum language compilers.
arXiv Detail & Related papers (2021-09-01T17:29:47Z)
QTRAN++: Improved Value Transformation for Cooperative Multi-Agent Reinforcement Learning [70.382101956278]
QTRAN is a reinforcement learning algorithm capable of learning the largest class of joint-action value functions. Despite its strong theoretical guarantee, it has shown poor empirical performance in complex environments. We propose a substantially improved version, coined QTRAN++.
arXiv Detail & Related papers (2020-06-22T05:08:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.