Related papers: From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories

From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories

URL: http://arxiv.org/abs/2603.02194v1
Date: Mon, 02 Mar 2026 18:54:28 GMT
Title: From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories
Authors: Mateus Karvat, Bram Adams, Sidney Givigi,
Abstract summary: This study systematically analyzed 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards.<n>Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria.<n>The adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability.
Score: 4.603321798937855
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous vehicle (AV) perception models are typically evaluated solely on benchmark performance metrics, with limited attention to code quality, production readiness and long-term maintainability. This creates a significant gap between research excellence and real-world deployment in safety-critical systems subject to international safety standards. To address this gap, we present the first large-scale empirical study of software quality in AV perception repositories, systematically analyzing 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards. Using static analysis tools (Pylint, Bandit, and Radon), we evaluated code errors, security vulnerabilities, maintainability, and development practices. Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria, defined as having zero critical errors and no high-severity security vulnerabilities. Security issues are highly concentrated, with the top five issues responsible for almost 80% of occurrences, which prompted us to develop a set of actionable guidelines to prevent them. Additionally, the adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability. Our findings highlight that leaderboard performance does not reflect production readiness and that targeted interventions could substantially improve the quality and safety of AV perception code.

Related papers

SeRe: A Security-Related Code Review Dataset Aligned with Real-World Review Activities [8.215547096412346]
Existing datasets and studies primarily focus on general-purpose code review comments.<n>We introduce textbfSeRe, a textbfsecurity-related code review dataset, constructed using an active learning-based ensemble classification approach.<n>We extracted 6,732 security-related reviews from 373,824 raw review instances, ensuring representativeness across multiple programming languages.
arXiv Detail & Related papers (2026-01-03T02:39:53Z)
Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics [3.0540716731676625]
Existing studies focus mainly on whether generated code passes the tests rather than whether it passes with quality.<n>This study conducted three complementary investigations: a systematic review of 108 papers, two industry workshops with practitioners from multiple organizations, and an empirical analysis of patching real-world software issues.<n>We found that security and performance efficiency dominate academic attention, while maintainability and other qualities are understudied.
arXiv Detail & Related papers (2025-11-13T12:56:07Z)
Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies [4.435429537888066]
Large language models (LLMs) have become indispensable for automated code generation, yet the quality and security of their outputs remain a critical concern.<n>We propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency.<n>Our findings highlight that enhancing the quality of user prompts constitutes a critical and effective strategy for strengthening the security of AI-generated code.
arXiv Detail & Related papers (2025-10-27T02:59:17Z)
CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention [68.95008546581339]
Existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality.<n>We propose CARE, a novel framework for decoding-time safety alignment that integrates three key components.<n>The framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience.
arXiv Detail & Related papers (2025-09-01T04:50:02Z)
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code [49.009041488527544]
A.S.E is a repository-level evaluation benchmark for assessing the security of AI-generated code.<n>Current large language models (LLMs) still struggle with secure coding.<n>A larger reasoning budget does not necessarily lead to better code generation.
arXiv Detail & Related papers (2025-08-25T15:11:11Z)
Beyond Easy Wins: A Text Hardness-Aware Benchmark for LLM-generated Text Detection [0.38233569758620056]
We present a novel evaluation paradigm for AI text detectors that prioritizes real-world and equitable assessment.<n>Our benchmark, SHIELD, addresses these limitations by integrating both reliability and stability factors into a unified evaluation metric.<n>We develop a model-agnostic humanification framework that modifies AI text to more closely resemble human authorship, incorporating a controllable hardness parameter.
arXiv Detail & Related papers (2025-07-21T06:37:27Z)
Domain-Agnostic Scalable AI Safety Ensuring Framework [6.421238475415244]
We propose the first domain-agnostic AI safety framework that achieves strong safety guarantees while preserving high performance.<n>Our framework includes: (1) an optimization component with chance constraints, (2) a safety classification model, (3) internal test data, (4) conservative testing procedures, (5) informative dataset quality measures, and (6) continuous approximate loss functions with gradient gradient.
arXiv Detail & Related papers (2025-04-29T16:38:35Z)
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z)
Towards Trustworthy GUI Agents: A Survey [64.6445117343499]
This survey examines the trustworthiness of GUI agents in five critical dimensions.<n>We identify major challenges such as vulnerability to adversarial attacks, cascading failure modes in sequential decision-making.<n>As GUI agents become more widespread, establishing robust safety standards and responsible development practices is essential.
arXiv Detail & Related papers (2025-03-30T13:26:00Z)
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models.<n>REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values.<n>We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z)
SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.