Exploring and Characterizing Large Language Models For Embedded System
Development and Debugging
- URL: http://arxiv.org/abs/2307.03817v2
- Date: Wed, 22 Nov 2023 01:26:07 GMT
- Title: Exploring and Characterizing Large Language Models For Embedded System
Development and Debugging
- Authors: Zachary Englhardt, Richard Li, Dilini Nissanka, Zhihan Zhang, Girish
Narayanswamy, Joseph Breda, Xin Liu, Shwetak Patel, Vikram Iyer
- Abstract summary: Large language models (LLMs) have shown remarkable abilities to generate code, however their ability to develop software for embedded systems has not been studied.
We develop an open source framework to evaluate leading LLMs to assess their capabilities and limitations for embedded system development.
We leverage this finding to study how human programmers interact with these tools, and develop an human-AI based software engineering workflow for building embedded systems.
- Score: 10.967443876391611
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have shown remarkable abilities to generate
code, however their ability to develop software for embedded systems, which
requires cross-domain knowledge of hardware and software has not been studied.
In this paper we develop an extensible, open source hardware-in-the-loop
framework to systematically evaluate leading LLMs (GPT-3.5, GPT-4, PaLM 2) to
assess their capabilities and limitations for embedded system development. We
observe through our study that even when these tools fail to produce working
code, they consistently generate helpful reasoning about embedded design tasks.
We leverage this finding to study how human programmers interact with these
tools, and develop an human-AI based software engineering workflow for building
embedded systems.
Our evaluation platform for verifying LLM generated programs uses sensor
actuator pairs for physical evaluation. We compare all three models with N=450
experiments and find surprisingly that GPT-4 especially shows an exceptional
level of cross-domain understanding and reasoning, in some cases generating
fully correct programs from a single prompt. In N=50 trials, GPT-4 produces
functional I2C interfaces 66% of the time. GPT-4 also produces register-level
drivers, code for LoRa communication, and context-specific power optimizations
for an nRF52 program resulting in over 740x current reduction to 12.2uA. We
also characterize the models' limitations to develop a generalizable human-AI
workflow for using LLMs in embedded system development. We evaluate our
workflow with 15 users including novice and expert programmers. We find that
our workflow improves productivity for all users and increases the success rate
for building a LoRa environmental sensor from 25% to 100%, including for users
with zero hardware or C/C++ experience.
Related papers
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation [13.800675921118348]
We propose a novel interactive workflow TiCoder for guided intent clarification.
We present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy.
We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions.
arXiv Detail & Related papers (2024-04-15T19:16:32Z) - SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents.
We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z) - Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming [12.355284125578342]
Large Language Models (LLMs) have become a focal point in modern software development.
LLMs offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants.
However, each system requires the LLM to be honed to its set of workspaces to ensure the best performance.
arXiv Detail & Related papers (2024-02-22T03:51:34Z) - LLM4PLC: Harnessing Large Language Models for Verifiable Programming of
PLCs in Industrial Control Systems [9.946058168276744]
Large Language Models (LLMs) fail to produce valid programs for Industrial Control Systems (ICS) operated by Programmable Logic Controllers (PLCs)
We propose a user-guided iterative pipeline leveraging user feedback and external verification tools including grammar checkers, compilers and SMV verifiers.
We run a complete test suite on GPT-3.5, GPT-4, Code Llama-7B, a fine-tuned Code Llama-7B model, Code Llama-34B, and a fine-tuned Code Llama-34B model.
arXiv Detail & Related papers (2024-01-08T23:52:42Z) - Experimenting a New Programming Practice with LLMs [6.8035637735756715]
We develop a prototype named AISD (AI-aided Software Development)
It is capable of taking high-level (potentially vague) user requirements as inputs.
It generates detailed use cases, prototype system designs, and subsequently system implementation.
arXiv Detail & Related papers (2024-01-02T06:50:20Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Comparing Software Developers with ChatGPT: An Empirical Investigation [0.0]
This paper conducts an empirical investigation, contrasting the performance of software engineers and AI systems, like ChatGPT, across different evaluation metrics.
The paper posits that a comprehensive comparison of software engineers and AI-based solutions, considering various evaluation criteria, is pivotal in fostering human-machine collaboration.
arXiv Detail & Related papers (2023-05-19T17:25:54Z) - Technology Readiness Levels for Machine Learning Systems [107.56979560568232]
Development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end.
We have developed a proven systems engineering approach for machine learning development and deployment.
Our "Machine Learning Technology Readiness Levels" framework defines a principled process to ensure robust, reliable, and responsible systems.
arXiv Detail & Related papers (2021-01-11T15:54:48Z) - Technology Readiness Levels for AI & ML [79.22051549519989]
Development of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end.
Engineering systems follow well-defined processes and testing standards to streamline development for high-quality, reliable results.
We propose a proven systems engineering approach for machine learning development and deployment.
arXiv Detail & Related papers (2020-06-21T17:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.