Data Annotation Quality Problems in AI-Enabled Perception System Development
- URL: http://arxiv.org/abs/2511.16410v1
- Date: Thu, 20 Nov 2025 14:30:51 GMT
- Title: Data Annotation Quality Problems in AI-Enabled Perception System Development
- Authors: Hina Saeeda, Tommy Johansson, Mazen Mohamad, Eric Knauss,
- Abstract summary: Data annotation is essential but highly error-prone in the development of AI-enabled perception systems.<n>We develop a taxonomy of 18 recurring annotation error types across three data-quality dimensions.<n>This study contributes to SE4AI by offering a shared vocabulary, diagnostic toolset, and actionable guidance for building trustworthy AI-enabled perception systems.
- Score: 3.716862357836751
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data annotation is essential but highly error-prone in the development of AI-enabled perception systems (AIePS) for automated driving, and its quality directly influences model performance, safety, and reliability. However, the industry lacks empirical insights into how annotation errors emerge and spread across the multi-organisational automotive supply chain. This study addresses this gap through a multi-organisation case study involving six companies and four research institutes across Europe and the UK. Based on 19 semi-structured interviews with 20 experts (50 hours of transcripts) and a six-phase thematic analysis, we develop a taxonomy of 18 recurring annotation error types across three data-quality dimensions: completeness (e.g., attribute omission, missing feedback loops, edge-case omissions, selection bias), accuracy (e.g., mislabelling, bounding-box inaccuracies, granularity mismatches, bias-driven errors), and consistency (e.g., inter-annotator disagreement, ambiguous instructions, misaligned hand-offs, cross-modality inconsistencies). The taxonomy was validated with industry practitioners, who reported its usefulness for root-cause analysis, supplier quality reviews, onboarding, and improving annotation guidelines. They described it as a failure-mode catalogue similar to FMEA. By conceptualising annotation quality as a lifecycle and supply-chain issue, this study contributes to SE4AI by offering a shared vocabulary, diagnostic toolset, and actionable guidance for building trustworthy AI-enabled perception systems.
Related papers
- Label Curation Using Agentic AI [3.500372926575144]
We present AURA, an agentic AI framework for large-scale, multi-modal data annotation.<n>AURA coordinates multiple AI agents to generate and validate labels without requiring ground truth.<n>AURA achieves accuracy improvements of up to 5.8% over baseline.<n>In more challenging settings with poor quality annotators, the improvement is up to 50% over baseline.
arXiv Detail & Related papers (2026-01-30T18:58:52Z) - Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification [71.98473277917962]
Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving.<n>We propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics.<n>We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification.
arXiv Detail & Related papers (2026-01-22T09:47:31Z) - RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension [65.81339691942757]
RPC-Bench is a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers.<n>We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts.
arXiv Detail & Related papers (2026-01-14T11:37:00Z) - Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation [7.268064183717186]
This study proposes an agentic XAI framework combining SHAP-based explainability with multimodal LLM-driven iterative refinement.<n>We tested this framework as an agricultural recommendation system using rice yield data from 26 fields in Japan.
arXiv Detail & Related papers (2025-12-24T09:19:15Z) - RE for AI in Practice: Managing Data Annotation Requirements for AI Autonomous Driving Systems [3.9255502531644204]
High-quality data annotation requirements are crucial for the development of safe and reliable AI-enabled systems.<n>Our study investigates how annotation requirements are defined and used in practice.<n>Key challenges include ambiguity, edge case complexity, evolving requirements, inconsistencies, and resource constraints.
arXiv Detail & Related papers (2025-11-19T20:27:30Z) - AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research [81.04845910798387]
Generating natural language explanations for threat detections remains an open problem in cybersecurity research.<n>We present AutoMalDesc, an automated static analysis summarization framework that operates independently at scale.<n>We publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) datasets, along with our methodology and evaluation framework.
arXiv Detail & Related papers (2025-11-17T13:05:25Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - A Defect Classification Framework for AI-Based Software Systems (AI-ODC) [0.0]
This paper proposes a framework inspired by the Orthogonal Defect Classification (ODC) paradigm.<n>The framework was adapted to accommodate the Data, Learning, and Thinking aspects of AI systems.
arXiv Detail & Related papers (2025-08-25T11:15:31Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG [51.120170062795566]
We propose Divide-Then-Align (DTA) to endow RAG systems with the ability to respond with "I don't know" when the query is out of the knowledge boundary.<n>DTA balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
arXiv Detail & Related papers (2025-05-27T08:21:21Z) - Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems [16.72158049599736]
We present an empirical study that investigates bug-fix patterns in autonomous driving systems (ADSes)<n>We analyze the commit histories and bug reports of two major autonomous driving projects, Apollo and Autoware, from 1,331 bug fixes.<n>Our study reveals several dominant bug-fix patterns, including those related to path planning, data flow, and configuration management.
arXiv Detail & Related papers (2025-02-04T02:13:05Z) - Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z) - Bridging the Communication Gap: Evaluating AI Labeling Practices for Trustworthy AI Development [41.64451715899638]
High-level AI labels, inspired by frameworks like EU energy labels, have been proposed to make the properties of AI models more transparent.<n>This study evaluates AI labeling through qualitative interviews along four key research questions.
arXiv Detail & Related papers (2025-01-21T06:00:14Z) - Guidance in Radiology Report Summarization: An Empirical Evaluation and
Error Analysis [3.0204520109309847]
We propose a domain-agnostic guidance signal for summarizing radiology reports.
We run an expert evaluation of four systems according to a taxonomy of 11 fine-grained errors.
We find that the most pressing differences between automatic summaries and those of radiologists relate to content selection.
arXiv Detail & Related papers (2023-07-24T13:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.