Related papers: Benchmarking Large Language Models on Homework Assessment in Circuit Analysis

Benchmarking Large Language Models on Homework Assessment in Circuit Analysis

URL: http://arxiv.org/abs/2506.06390v1
Date: Thu, 05 Jun 2025 15:16:30 GMT
Title: Benchmarking Large Language Models on Homework Assessment in Circuit Analysis
Authors: Liangliang Chen, Zhihao Qin, Yiming Guo, Jacqueline Rohde, Ying Zhang,
Abstract summary: Large language models (LLMs) have the potential to revolutionize various fields, including code development, robotics, finance, and education.<n>This paper investigates how LLMs can be leveraged in engineering education.<n>We benchmark the capabilities of different LLMs, including GPT-3.5 Turbo, GPT-4o, and Llama 3 70B.
Score: 5.040027178589699
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have the potential to revolutionize various fields, including code development, robotics, finance, and education, due to their extensive prior knowledge and rapid advancements. This paper investigates how LLMs can be leveraged in engineering education. Specifically, we benchmark the capabilities of different LLMs, including GPT-3.5 Turbo, GPT-4o, and Llama 3 70B, in assessing homework for an undergraduate-level circuit analysis course. We have developed a novel dataset consisting of official reference solutions and real student solutions to problems from various topics in circuit analysis. To overcome the limitations of image recognition in current state-of-the-art LLMs, the solutions in the dataset are converted to LaTeX format. Using this dataset, a prompt template is designed to test five metrics of student solutions: completeness, method, final answer, arithmetic error, and units. The results show that GPT-4o and Llama 3 70B perform significantly better than GPT-3.5 Turbo across all five metrics, with GPT-4o and Llama 3 70B each having distinct advantages in different evaluation aspects. Additionally, we present insights into the limitations of current LLMs in several aspects of circuit analysis. Given the paramount importance of ensuring reliability in LLM-generated homework assessment to avoid misleading students, our results establish benchmarks and offer valuable insights for the development of a reliable, personalized tutor for circuit analysis -- a focus of our future work. Furthermore, the proposed evaluation methods can be generalized to a broader range of courses for engineering education in the future.

Related papers

BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models [50.17907898478795]
We introduce BinMetric, a benchmark designed to evaluate the performance of large language models on binary analysis tasks.<n>BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks.<n>Our empirical study on this benchmark investigates the binary analysis capabilities of various state-of-the-art LLMs, revealing their strengths and limitations in this field.
arXiv Detail & Related papers (2025-05-12T08:54:07Z)
Performance Review on LLM for solving leetcode problems [7.377558533352298]
This paper presents a comprehensive performance evaluation of Large Language Models (LLMs) in solving programming challenges from Leetcode.<n>We generated solutions with multiple LLMs, including GPT-4 and GPT-3.5-turbo.<n>Our results highlight the strengths and limitations of current LLMs in code generation and problem-solving tasks.
arXiv Detail & Related papers (2025-02-16T08:52:45Z)
CIRCUIT: A Benchmark for Circuit Interpretation and Reasoning Capabilities of LLMs [15.34624510334892]
The role of Large Language Models (LLMs) has not been extensively explored in analog circuit design.<n>We created the CIRCUIT dataset consisting of 510 question-answer pairs spanning various levels of analog-circuit-related subjects.<n>The best-performing model on our dataset, GPT-4o, achieves 48.04% accuracy when evaluated on the final numerical answer.
arXiv Detail & Related papers (2025-02-11T21:53:48Z)
GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.<n>GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z)
Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis [0.0]
This study explores the performance of large language models (LLMs) in solving competitive programming problems from the Romanian Informatics Olympiad at the county level. We collected and analyzed a dataset comprising 304 challenges from 2002 to 2023. The analysis revealed significant variations in LLM performance across different grades and problem types.
arXiv Detail & Related papers (2024-08-31T10:39:54Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.056044348209483]
Unit testing, crucial for identifying bugs in code modules like classes and methods, is often neglected by developers due to time constraints. Large Language Models (LLMs), like GPT and Mistral, show promise in software engineering, including in test generation.
arXiv Detail & Related papers (2024-06-28T20:38:41Z)
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval.<n>DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task.<n> Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond [29.778018058541676]
GPT-Fathom is an open-source and reproducible evaluation suite for large language models (LLMs) built on top of OpenAI Evals. We evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all aligned under settings.
arXiv Detail & Related papers (2023-09-28T16:43:35Z)
Large Language Models as Data Preprocessors [9.99065004972981]
Large Language Models (LLMs) have marked a significant advancement in artificial intelligence. This study explores their potential in data preprocessing, a critical stage in data mining and analytics applications. We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques.
arXiv Detail & Related papers (2023-08-30T23:28:43Z)
Instruction Tuning for Large Language Models: A Survey [52.86322823501338]
We make a systematic review of the literature, including the general methodology of supervised fine-tuning (SFT)<n>We also review the potential pitfalls of SFT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies.
arXiv Detail & Related papers (2023-08-21T15:35:16Z)
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models [23.344490944210456]
We present 515Bench, a more challenging benchmark dataset for evaluating the problem solving abilities of large language models (LLMs) We curate challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT-Advanced exam. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%.
arXiv Detail & Related papers (2023-05-24T11:55:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.