Related papers: TimelyHLS: LLM-Based Timing-Aware and Architecture-Specific FPGA HLS Optimization

TimelyHLS: LLM-Based Timing-Aware and Architecture-Specific FPGA HLS Optimization

URL: http://arxiv.org/abs/2507.17962v1
Date: Wed, 23 Jul 2025 22:08:15 GMT
Title: TimelyHLS: LLM-Based Timing-Aware and Architecture-Specific FPGA HLS Optimization
Authors: Nowfel Mashnoor, Mohammad Akyash, Hadi Kamali, Kimia Azar,
Abstract summary: TimelyHLS generates HLS code annotated with both timing-critical and design-specific pragmas.<n>TimelyHLS consistently achieves timing closure and functional correctness across platforms.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Achieving timing closure and design-specific optimizations in FPGA-targeted High-Level Synthesis (HLS) remains a significant challenge due to the complex interaction between architectural constraints, resource utilization, and the absence of automated support for platform-specific pragmas. In this work, we propose TimelyHLS, a novel framework integrating Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to automatically generate and iteratively refine HLS code optimized for FPGA-specific timing and performance requirements. TimelyHLS is driven by a structured architectural knowledge base containing FPGA-specific features, synthesis directives, and pragma templates. Given a kernel, TimelyHLS generates HLS code annotated with both timing-critical and design-specific pragmas. The synthesized RTL is then evaluated using commercial toolchains, and simulation correctness is verified against reference outputs via custom testbenches. TimelyHLS iteratively incorporates synthesis logs and performance reports into the LLM engine for refinement in the presence of functional discrepancies. Experimental results across 10 FPGA architectures and diverse benchmarks show that TimelyHLS reduces the need for manual tuning by up to 70%, while achieving up to 4x latency speedup (e.g., 3.85x for Matrix Multiplication, 3.7x for Bitonic Sort) and over 50% area savings in certain cases (e.g., 57% FF reduction in Viterbi). TimelyHLS consistently achieves timing closure and functional correctness across platforms, highlighting the effectiveness of LLM-driven, architecture-aware synthesis in automating FPGA design.

Related papers

QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning [38.679497621876926]
LIFT is a large language model (LLM)-based coding assistant for HLS that automatically generates performance-critical pragmas.<n>We fine-tune the LLM by tightly integrating and supervising the training process with a graph neural network (GNN)
arXiv Detail & Related papers (2025-04-29T21:42:59Z)
HLS-Eval: A Benchmark and Framework for Evaluating LLMs on High-Level Synthesis Design Tasks [4.71707720395444]
We introduce HLS-Eval, the first complete benchmark and evaluation framework for HLS-driven design.<n>The benchmark includes 94 unique designs drawn from standard HLS benchmarks and novel sources.<n>Beyond the benchmark, HLS-Eval offers a modular Python framework for automated, parallel evaluation of both local and hosted LLMs.
arXiv Detail & Related papers (2025-04-16T17:30:36Z)
SymRTLO: Enhancing RTL Code Optimization with LLMs and Neuron-Inspired Symbolic Reasoning [18.40402135952776]
This paper presents SymRTLO, a novel neuron-symbolic RTL optimization framework.<n>A symbolic module is proposed for analyzing and optimizing finite state machine (FSM) logic.<n> Experiments on the RTL-Rewriter benchmark with Synopsys Design Compiler and Yosys show that SymRTLO improves power, performance, and area (PPA) by up to 43.9%, 62.5%, and 51.1%, respectively.
arXiv Detail & Related papers (2025-04-14T16:15:55Z)
TuRTLe: A Unified Evaluation of LLMs for RTL Generation [0.6010802600885173]
We propose TuRTLe, a unified evaluation framework designed to assess LLMs across key RTL generation tasks.<n>We benchmark a diverse set of open LLMs and analyze their strengths and weaknesses in EDA-specific tasks.<n>Our results show that reasoning-based models, such as DeepSeek R1, consistently outperform others across multiple evaluation criteria.
arXiv Detail & Related papers (2025-03-31T07:43:12Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
Unlocking Real-Time Fluorescence Lifetime Imaging: Multi-Pixel Parallelism for FPGA-Accelerated Processing [2.369919866595525]
We propose a method to achieve real-time FLI using an FPGA-based hardware accelerator. We implement a GRU-based sequence-to-sequence (Seq2Seq) model on an FPGA board compatible with time-resolved cameras. By integrating a GRU-based Seq2Seq model and its compressed version, called Seq2SeqLite, we were able to process multiple pixels in parallel, reducing latency compared to sequential processing.
arXiv Detail & Related papers (2024-10-09T18:24:23Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding. We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z)
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors. The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z)
VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer [121.85581713299918]
We propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized Vision Transformers (ViTs) Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations. This is the first time quantization has been incorporated into ViT acceleration on FPGAs.
arXiv Detail & Related papers (2022-01-17T20:27:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.