HLSTester: Efficient Testing of Behavioral Discrepancies with LLMs for High-Level Synthesis
- URL: http://arxiv.org/abs/2504.14641v1
- Date: Sun, 20 Apr 2025 14:45:01 GMT
- Title: HLSTester: Efficient Testing of Behavioral Discrepancies with LLMs for High-Level Synthesis
- Authors: Kangwei Xu, Bing Li, Grace Li Zhang, Ulf Schlichtmann,
- Abstract summary: Existing methods for testing behavioral discrepancies in high-level synthesis (HLS) are still immature.<n>We propose HLSTester, a large language model (LLM) aided testing framework that efficiently detects behavioral discrepancies in HLS.<n>To mitigate hallucinations in LLMs, the testbenches for original C/C++ programs are leveraged to guide LLMs in generating HLS-compatible testbenches.
- Score: 4.544714097232974
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In high-level synthesis (HLS), C/C++ programs with synthesis directives are used to generate circuits for FPGA implementations. However, hardware-specific and platform-dependent characteristics in these implementations can introduce behavioral discrepancies between the original C/C++ programs and the circuits after high-level synthesis. Existing methods for testing behavioral discrepancies in HLS are still immature, and the testing workflow requires significant human efforts. To address this challenge, we propose HLSTester, a large language model (LLM) aided testing framework that efficiently detects behavioral discrepancies in HLS. To mitigate hallucinations in LLMs and enhance prompt quality, the testbenches for original C/C++ programs are leveraged to guide LLMs in generating HLS-compatible testbenches, effectively eliminating certain traditional C/C++ constructs that are incompatible with HLS tools. Key variables are pinpointed through a backward slicing technique in both C/C++ and HLS programs to monitor their runtime spectra, enabling an in-depth analysis of the discrepancy symptoms. To reduce test time, a testing input generation mechanism is introduced to integrate dynamic mutation with insights from an LLM-based progressive reasoning chain. In addition, repetitive hardware testing is skipped by a redundancy-aware filtering technique for the generated test inputs. Experimental results demonstrate that the proposed LLM-aided testing framework significantly accelerates the testing workflow while achieving higher testbench simulation pass rates compared with the traditional method and the direct use of LLMs on the same HLS programs.
Related papers
- Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [82.75174050101108]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z) - HLS-Eval: A Benchmark and Framework for Evaluating LLMs on High-Level Synthesis Design Tasks [4.71707720395444]
We introduce HLS-Eval, the first complete benchmark and evaluation framework for HLS-driven design.<n>The benchmark includes 94 unique designs drawn from standard HLS benchmarks and novel sources.<n>Beyond the benchmark, HLS-Eval offers a modular Python framework for automated, parallel evaluation of both local and hosted LLMs.
arXiv Detail & Related papers (2025-04-16T17:30:36Z) - Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis [14.458529723566379]
Large language models (LLMs) can be employed for programming languages such as Python and C++.<n>This paper explores leveraging LLMs to generate High-Level Synthesis (HLS)-based hardware design.
arXiv Detail & Related papers (2025-02-19T17:53:59Z) - Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.
We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.
Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z) - Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency [2.4936576553283283]
The integration of Large Language Models (LLMs) into software engineering has shown potential to enhance productivity.<n>This paper investigates whether LLM support improves defect detection effectiveness during unit testing.
arXiv Detail & Related papers (2025-02-13T22:27:55Z) - C2HLSC: Leveraging Large Language Models to Bridge the Software-to-Hardware Design Gap [20.528640971583112]
This paper investigates Large Language Models for automatically C code into HLS-compatible formats.<n>We present a case study using an LLM to rewrite C code for NIST 800-22 tests randomness, a QuickSort algorithm, and AES-128 into HLS-synthesizable C.
arXiv Detail & Related papers (2024-11-29T19:22:52Z) - Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy [0.5057850174013127]
Modern Large Language Model (LLM)-based programming agents often rely on test execution feedback to refine their generated code.<n>This paper introduces VALTEST, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs.<n>Experiments show that VALTEST boosts test validity by up to 29% and improves code generation performance, as evidenced by significant increases in pass@1 scores.
arXiv Detail & Related papers (2024-11-13T00:07:32Z) - SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists [59.08999823652293]
We propose SYNTHEVAL to generate a wide range of test types for a comprehensive evaluation of NLP models.
In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit.
We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.
arXiv Detail & Related papers (2024-08-30T17:41:30Z) - Are LLMs Any Good for High-Level Synthesis? [1.3927943269211591]
Large Language Models (LLMs) can streamline or replace the High-Level Synthesis (HLS) process.
LLMs can understand natural language specifications and translate C code or natural language specifications.
This study aims to illuminate the role of LLMs in HLS, identifying promising directions for optimized hardware design in applications such as AI acceleration, embedded systems, and high-performance computing.
arXiv Detail & Related papers (2024-08-19T21:40:28Z) - Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation.<n>Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs.<n>We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z) - Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases.
We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs.
Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z) - Hyperparameters in Continual Learning: A Reality Check [53.30082523545212]
Continual learning (CL) aims to train a model on a sequence of tasks while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge)
arXiv Detail & Related papers (2024-03-14T03:13:01Z) - Hint-before-Solving Prompting: Guiding LLMs to Effectively Utilize
Encoded Knowledge [85.17343729885003]
We introduce Hint-before-Solving Prompting (HSP), which guides the model to generate hints for solving the problem.
HSP can effectively improve the accuracy of reasoning tasks.
We build the HSPMATH dataset based on HSP and fine-tuned Llemma-7B, reaching 64.3 accuracy.
arXiv Detail & Related papers (2024-02-22T05:58:03Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.