Related papers: Comparative Analysis of the Code Generated by Popular Large Language Models (LLMs) for MISRA C++ Compliance

Related papers

CodeSimpleQA: Scaling Factuality in Code Large Language Models [55.705748501461294]
We present CodeSimpleQA, a comprehensive benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions.<n>We also create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-12-22T14:27:17Z)
LLM-CSEC: Empirical Evaluation of Security in C/C++ Code Generated by Large Language Models [3.82562358840301]
This work focuses on examining and evaluating the security of large language models (LLMs)<n>We used ten different LLMs for code generation and analyzed the outputs through static analysis.<n>The amount of Common Weaknession (CWE)s present in AI-generated code is concerning.
arXiv Detail & Related papers (2025-11-24T10:31:53Z)
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z)
InfCode-C++: Intent-Guided Semantic Retrieval and AST-Structured Search for C++ Issue Resolution [31.437457217953835]
We introduce INFCODE-C++, the first C++-aware autonomous system for end-to-end issue resolution.<n>The system combines two complementary retrieval mechanisms -- semantic code-intent retrieval and deterministic AST-structured querying.<n>It achieves a resolution rate of 25.58%, outperforming the strongest prior agent by 10.85 percentage points and more than doubling the performance of MSWE-agent.
arXiv Detail & Related papers (2025-11-20T03:05:26Z)
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [59.90381306452982]
evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer.<n>We introduce SWE-1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework.<n>SWE- spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests.
arXiv Detail & Related papers (2025-11-07T18:01:32Z)
Distilling Lightweight Language Models for C/C++ Vulnerabilities [7.45549460594508]
FineSec is a novel framework that harnesses Large Language Models through knowledge distillation to enable efficient and precise vulnerability identification in C/C++s.<n>By integrating data preparation, training, evaluation, and continuous learning into a unified, single-task workflow, FineSec offers a streamlined approach.
arXiv Detail & Related papers (2025-10-08T04:58:51Z)
Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity [4.478789600295493]
We present a large-scale comparison of code authored by human developers and three state-of-the-art LLMs, i.e., ChatGPT, DeepSeek-Coder, and Qwen-Coder.<n>Our evaluation spans over 500k code samples in two widely used languages, Python and Java, classifying defects via Orthogonal Defect Classification and security vulnerabilities using the Common Weaknession.<n>We find that AI-generated code is generally simpler and more prone to unused constructs and hardcoded, while human-written code exhibits greater structural complexity and a higher concentration of maintainability issues.
arXiv Detail & Related papers (2025-08-29T13:51:28Z)
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code [49.009041488527544]
A.S.E is a repository-level evaluation benchmark for assessing the security of AI-generated code.<n>Current large language models (LLMs) still struggle with secure coding.<n>A larger reasoning budget does not necessarily lead to better code generation.
arXiv Detail & Related papers (2025-08-25T15:11:11Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Decompiling Smart Contracts with a Large Language Model [51.49197239479266]
Despite Etherscan's 78,047,845 smart contracts deployed on (as of May 26, 2025), a mere 767,520 ( 1%) are open source.<n>This opacity necessitates the automated semantic analysis of on-chain smart contract bytecode.<n>We introduce a pioneering decompilation pipeline that transforms bytecode into human-readable and semantically faithful Solidity code.
arXiv Detail & Related papers (2025-06-24T13:42:59Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
The Code Barrier: What LLMs Actually Understand? [7.407441962359689]
This research uses code obfuscation as a structured testing framework to evaluate semantic understanding capabilities of language models.<n>Findings show a statistically significant performance decline as obfuscation complexity increases.<n>This research introduces a new evaluation approach for assessing code comprehension in language models.
arXiv Detail & Related papers (2025-04-14T14:11:26Z)
Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis [10.268191178804168]
This paper analyzes the security of code generated by Large Language Models (LLMs) across different programming languages.<n>Our research shows that while LLMs can automate code creation, their security effectiveness varies by language.
arXiv Detail & Related papers (2025-02-03T22:03:13Z)
INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness [110.6921470281479]
We introduce INDICT: a new framework that empowers large language models with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes.
arXiv Detail & Related papers (2024-06-23T15:55:07Z)
Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC) SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness. We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language.
arXiv Detail & Related papers (2024-06-05T22:16:19Z)
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs. Our studies reveal a new and universal safety vulnerability of these models against code input. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z)
Rust for Embedded Systems: Current State, Challenges and Open Problems (Extended Report) [6.414678578343769]
This paper performs the first systematic study to holistically understand the current state and challenges of using RUST for embedded systems. We collected a dataset of 2,836 RUST embedded software spanning various categories and 5 Static Application Security Testing ( SAST) tools. We found that existing RUST software support is inadequate, SAST tools cannot handle certain features of RUST embedded software, resulting in failures, and the prevalence of advanced types in existing RUST software makes it challenging to engineer interoperable code.
arXiv Detail & Related papers (2023-11-08T23:59:32Z)
ProgSG: Cross-Modality Representation Learning for Programs in Electronic Design Automation [38.023395256208055]
High-level synthesis (HLS) allows a developer to compile a high-level description in the form of software code in C and C++. HLS tools still require microarchitecture decisions, expressed in terms of pragmas. We propose ProgSG allowing the source code sequence modality and the graph modalities to interact with each other in a deep and fine-grained way.
arXiv Detail & Related papers (2023-05-18T09:44:18Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.