Related papers: DUET: Agentic Design Understanding via Experimentation and Testing

DUET: Agentic Design Understanding via Experimentation and Testing

URL: http://arxiv.org/abs/2512.06247v1
Date: Sat, 06 Dec 2025 02:16:28 GMT
Title: DUET: Agentic Design Understanding via Experimentation and Testing
Authors: Gus Henry Smith, Sandesh Adhikary, Vineet Thumuluri, Karthik Suresh, Vivek Pandit, Kartik Hegde, Hamid Shojaei, Chandra Bhagavatula,
Abstract summary: DUET is a general methodology for developing Design Understanding via Experimentation and Testing.<n>It iteratively generates hypotheses, tests them with EDA tools, and integrates the results to build a bottom-up understanding of the design.<n>We show that DUET improves AI agent performance on formal verification, when compared to a baseline flow without experimentation.
Score: 6.787641711048685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agents powered by large language models (LLMs) are being used to solve increasingly complex software engineering challenges, but struggle with hardware design tasks. Register Transfer Level (RTL) code presents a unique challenge for LLMs, as it encodes complex, dynamic, time-evolving behaviors using the low-level language features of SystemVerilog. LLMs struggle to infer these complex behaviors from the syntax of RTL alone, which limits their ability to complete all downstream tasks like code completion, documentation, or verification. In response to this issue, we present DUET: a general methodology for developing Design Understanding via Experimentation and Testing. DUET mimics how hardware design experts develop an understanding of complex designs: not just via a one-off readthrough of the RTL, but via iterative experimentation using a number of tools. DUET iteratively generates hypotheses, tests them with EDA tools (e.g., simulation, waveform inspection, and formal verification), and integrates the results to build a bottom-up understanding of the design. In our evaluations, we show that DUET improves AI agent performance on formal verification, when compared to a baseline flow without experimentation.

Related papers

Step-Level Sparse Autoencoder for Reasoning Process Interpretation [48.99201531966593]
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning.<n>We propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features.<n> Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features.
arXiv Detail & Related papers (2026-03-03T14:25:02Z)
Large Language Model Agent for User-friendly Chemical Process Simulations [0.0]
A large language model (LLM) agent is integrated with AVEVA Process Model Protocol (MCP), allowing natural language simulations.<n>Two case studies assess the framework across different task complexities and interaction modes.<n>The framework benefits both educational purposes, by translating technical concepts and demonstrating, and experienced practitioners by automating data extraction, speeding routine tasks, and supporting.<n>While current limitations such as oversimplification, calculation errors, and technical hiccups mean expert oversight is still needed, the framework suggests LLM-based agents can become valuable collaborators.
arXiv Detail & Related papers (2026-01-15T12:18:45Z)
Understanding Specification-Driven Code Generation with LLMs: An Empirical Study Design [2.687678248171195]
Large Language Models (LLMs) are increasingly integrated into software development, yet their behavior in structured, specification-driven processes remains poorly understood.<n>This paper presents an empirical study design using CURRANTE, a Visual Studio Code extension that enables a human-in-the-loop workflow for LLM-assisted code generation.<n>The study aims to analyze how human intervention in specification and test refinement influences the quality and dynamics of LLM-generated code.
arXiv Detail & Related papers (2026-01-07T12:46:57Z)
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z)
Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z)
Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems [0.0]
Large Language Models (LLMs) like ChatGPT, Copilot, Gemini, and DeepSeek are transforming software engineering by automating key tasks.<n>This study benchmarks these four prominent LLMs on one hundred and fifty LeetCode problems across easy, medium, and hard difficulties.<n>We evaluate each model based on execution time, memory usage, and algorithmic complexity, revealing significant performance differences.
arXiv Detail & Related papers (2025-08-05T21:50:52Z)
VeriMind: Agentic LLM for Automated Verilog Generation with a Novel Evaluation Metric [4.590930025882158]
We propose VeriMind, an agentic LLM framework for Verilog code generation.<n>We introduce a novel evaluation metric-pass@ARC-which combines the conventional pass@k measure with Average Refinement Cycles (ARC) to capture both success rate and the efficiency of iterative refinement.<n> Experimental results on diverse hardware design tasks demonstrated that our approach achieved up to $8.3%$ improvement on pass@k metric and $8.1%$ on pass@ARC metric.
arXiv Detail & Related papers (2025-03-15T23:43:06Z)
VerilogReader: LLM-Aided Hardware Test Generation [5.012023213660125]
Large Language Model (LLM) with their advanced understanding and inference capabilities has introduced a novel approach. In this work, we investigate the integration of LLM into the Coverage Directed Test Generation (CDG) process. We compare our framework with random testing, using our self-designed Verilog benchmark suite.
arXiv Detail & Related papers (2024-06-03T07:20:51Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)
When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity. Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.