DeepSeek-V3, GPT-4, Phi-4, and LLaMA-3.3 generate correct code for LoRaWAN-related engineering tasks
- URL: http://arxiv.org/abs/2502.14926v3
- Date: Mon, 07 Apr 2025 18:44:32 GMT
- Title: DeepSeek-V3, GPT-4, Phi-4, and LLaMA-3.3 generate correct code for LoRaWAN-related engineering tasks
- Authors: Daniel Fernandes, João P. Matos-Carvalho, Carlos M. Fernandes, Nuno Fachada,
- Abstract summary: This paper investigates the performance of 16 Large Language Models (LLMs) in automating LoRaWAN-related engineering tasks.<n>To assess this, we compared locally run models against state-of-the-art alternatives, such as GPT-4 and DeepSeek-V3.<n>Results show that while DeepSeek-V3 and GPT-4 consistently provided accurate solutions, certain smaller models -- particularly Phi-4 and LLaMA-3.3 -- also demonstrated strong performance.
- Score: 0.8301471481260676
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper investigates the performance of 16 Large Language Models (LLMs) in automating LoRaWAN-related engineering tasks involving optimal placement of drones and received power calculation under progressively complex zero-shot, natural language prompts. The primary research question is whether lightweight, locally executed LLMs can generate correct Python code for these tasks. To assess this, we compared locally run models against state-of-the-art alternatives, such as GPT-4 and DeepSeek-V3, which served as reference points. By extracting and executing the Python functions generated by each model, we evaluated their outputs on a zero-to-five scale. Results show that while DeepSeek-V3 and GPT-4 consistently provided accurate solutions, certain smaller models -- particularly Phi-4 and LLaMA-3.3 -- also demonstrated strong performance, underscoring the viability of lightweight alternatives. Other models exhibited errors stemming from incomplete understanding or syntactic issues. These findings illustrate the potential of LLM-based approaches for specialized engineering applications while highlighting the need for careful model selection, rigorous prompt design, and targeted domain fine-tuning to achieve reliable outcomes.
Related papers
- Code Generation with Small Language Models: A Deep Evaluation on Codeforces [2.314213846671956]
Small Language Models offer faster inference, lower deployment overhead, and better adaptability to domain-specific tasks.
We benchmark five open SLMs across 280 Codeforces problems spanning Elo ratings from 800 to 2100.
PHI-4 14B achieved the best performance among SLMs, with a pass@3 of 63.6%.
arXiv Detail & Related papers (2025-04-09T23:57:44Z) - GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks [0.0]
Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks.
Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation.
arXiv Detail & Related papers (2025-03-23T16:20:14Z) - See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop.
Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances.
We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z) - MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [62.02920842630234]
We show how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost.
We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors.
For evaluation, we unify datasets from recent work on fact-checking and grounding LLM generations into a new benchmark, LLM-AggreFact.
arXiv Detail & Related papers (2024-04-16T17:59:10Z) - LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4
and Bard's Capacity to Handle Object-Oriented Programming Assignments [0.0]
Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments.
In this study, we experimented with three prominent LLMs to solve real-world OOP exercises used in educational settings.
The findings revealed that while the models frequently achieved mostly working solutions to the exercises, they often overlooked the best practices of OOP.
arXiv Detail & Related papers (2024-03-10T16:40:05Z) - Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies [47.129504708849446]
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing.
LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution.
In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available.
arXiv Detail & Related papers (2024-02-27T10:44:52Z) - TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task.
We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Large Language Models for Test-Free Fault Localization [11.080712737595174]
We propose a language model based fault localization approach that locates buggy lines of code without any test coverage information.
We fine-tune language models with 350 million, 6 billion, and 16 billion parameters on small, manually curated corpora of buggy programs.
Our empirical evaluation shows that LLMAO improves the Top-1 results over the state-of-the-art machine learning fault localization (MLFL) baselines by 2.3%-54.4%, and Top-5 results by 14.4%-35.6%.
arXiv Detail & Related papers (2023-10-03T01:26:39Z) - Benchmarking the Abilities of Large Language Models for RDF Knowledge
Graph Creation and Comprehension: How Well Do LLMs Speak Turtle? [0.0]
Large Language Models (LLMs) are advancing at a rapid pace, with significant improvements at natural language processing and coding tasks.
To evaluate the proficiency of various LLMs, we created a set of five tasks that probe their ability to parse, understand, analyze, and create knowledge graphs serialized in Turtle syntax.
The evaluation encompassed four commercially available LLMs - GPT-3.5, GPT-4, Claude 1.3, and Claude 2.0, as well as two freely accessible offline models, GPT4All Vicuna and GPT4All Falcon 13B.
arXiv Detail & Related papers (2023-09-29T10:36:04Z) - Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents [53.78782375511531]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.