Related papers: LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation

LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation

URL: http://arxiv.org/abs/2602.07032v1
Date: Tue, 03 Feb 2026 04:48:26 GMT
Title: LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation
Authors: Yuheng Wu, Berk Gokmen, Zhouhua Xie, Peijing Li, Caroline Trippel, Priyanka Raina, Thierry Tambe,
Abstract summary: We present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior.<n>Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline.
Score: 3.4714122723537333
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Finite-state reasoning, the ability to understand and implement state-dependent behavior, is central to hardware design. In this paper, we present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior from natural-language specifications and translate it into correct register transfer-level (RTL) implementations. Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline. LLM-FSM first constructs FSM with configurable state counts and constrained transition structures. It then prompts LLMs to express each FSM in a structured YAML format with an application context, and to further convert that YAML into a natural-language (NL) specification. From the same YAML, our pipeline synthesizes the reference RTL and testbench in a correct-by-construction manner. All 1,000 problems are verified using LLM-based and SAT-solver-based checks, with human review on a subset. Our experiments show that even the strongest LLMs exhibit sharply declining accuracy as FSM complexity increases. We further demonstrate that training-time scaling via supervised fine-tuning (SFT) generalizes effectively to out-of-distribution (OOD) tasks, while increasing test-time compute improves reasoning reliability. Finally, LLM-FSM remains extensible by allowing its FSM complexity to scale with future model capabilities.

Related papers

AutoFSM: A Multi-agent Framework for FSM Code Generation with IR and SystemC-Based Testing [2.5793366206387827]
This paper proposes AutoFSM, a collaborative framework designed for finite state machine (FSM) code generation tasks.<n>AutoFSM introduces a structurally clear intermediate representation (IR) to reduce syntax error rate during code generation.<n>It is the first to integrate SystemC-based modeling with automatic testbench generation.
arXiv Detail & Related papers (2025-12-12T09:15:46Z)
Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning [95.44766931218896]
Multi-modal large language models (MLLMs) still lag behind text-based reasoning.<n>We introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable.<n>We propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO) to align the MLLM's perceptual output with the final reasoning task.
arXiv Detail & Related papers (2025-06-05T02:28:07Z)
An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning [52.29223403698673]
This paper examines the use of Conformal Language Modelling (CLM) alongside Answer Set Programming (ASP)<n>We apply CLM to generate sets of ASP programs from an LLM, providing statistical guarantees on the correctness of the outputs.<n> Experimental results show that CLM significantly outperforms baseline models that use standard sampling methods.
arXiv Detail & Related papers (2025-03-07T14:10:10Z)
Stacking Small Language Models for Generalizability [0.0]
Large language models (LLMs) generalize strong performance across different natural language benchmarks. This paper introduces a new approach called fine-tuning stacks of language models (FSLM) By fine-tuning each SLM to perform a specific task, this approach breaks down high level reasoning into multiple lower-level steps that specific SLMs are responsible for. As a result, FSLM allows for lower training and inference costs, and also improves model interpretability as each SLM communicates with the subsequent one through natural language.
arXiv Detail & Related papers (2024-10-21T01:27:29Z)
Logic Distillation: Learning from Code Function by Function for Decision-making Tasks [71.08128339865428]
Large language models (LLMs) have garnered increasing attention owing to their powerful logical reasoning capabilities.<n> Knowledge distillation (KD) aims to empower S-LLMs with the capabilities of L-LLMs, while S-LLMs merely mimic the outputs of L-LLMs.<n>We propose a novel framework called Logic Distillation (LD) to tackle the identified challenges.
arXiv Detail & Related papers (2024-07-28T05:34:42Z)
$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks [21.12437562185667]
This paper presents a new approach for scaling LLM assessment in translating formal syntax to natural language. We use context-free grammars (CFGs) to generate out-of-distribution datasets on the fly. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm.
arXiv Detail & Related papers (2024-03-27T08:08:00Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution. We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z)
Translating Regulatory Clauses into Executable Codes for Building Design Checking via Large Language Model Driven Function Matching and Composing [8.293447126161475]
We propose a large language model (LLM)-based approach with rule-based adaptive prompts that match clauses to atomic functions.<n>Experiments show LLM-FuncMapper outperforms fine-tuning methods by 19% in function matching.<n>Case study demonstrates that LLM-FuncMapper can automatically compose multiple atomic functions to generate executable code.
arXiv Detail & Related papers (2023-08-17T01:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.