AutoMT: A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems
- URL: http://arxiv.org/abs/2510.19438v1
- Date: Wed, 22 Oct 2025 10:11:05 GMT
- Title: AutoMT: A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems
- Authors: Linfeng Liang, Chenkai Tan, Yao Deng, Yingfeng Cai, T. Y Chen, Xi Zheng,
- Abstract summary: AutoMT is a framework that automates the extraction of Metamorphic Relations (MRs) from local traffic rules.<n>A vision-language agent analyzes scenarios, and a search agent retrieves suitable MRs from a RAG-based database to generate follow-up cases.<n>Experiments show that AutoMT achieves up to 5 x higher test diversity in follow-up case generation compared to the best baseline.
- Score: 14.084623367678084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous Driving Systems (ADS) are safety-critical, where failures can be severe. While Metamorphic Testing (MT) is effective for fault detection in ADS, existing methods rely heavily on manual effort and lack automation. We present AutoMT, a multi-agent MT framework powered by Large Language Models (LLMs) that automates the extraction of Metamorphic Relations (MRs) from local traffic rules and the generation of valid follow-up test cases. AutoMT leverages LLMs to extract MRs from traffic rules in Gherkin syntax using a predefined ontology. A vision-language agent analyzes scenarios, and a search agent retrieves suitable MRs from a RAG-based database to generate follow-up cases via computer vision. Experiments show that AutoMT achieves up to 5 x higher test diversity in follow-up case generation compared to the best baseline (manual expert-defined MRs) in terms of validation rate, and detects up to 20.55% more behavioral violations. While manual MT relies on a fixed set of predefined rules, AutoMT automatically extracts diverse metamorphic relations that augment real-world datasets and help uncover corner cases often missed during in-field testing and data collection. Its modular architecture separating MR extraction, filtering, and test generation supports integration into industrial pipelines and potentially enables simulation-based testing to systematically cover underrepresented or safety-critical scenarios.
Related papers
- Automated structural testing of LLM-based agents: methods, framework, and case studies [0.05254956925594667]
LLM-based agents are rapidly being adopted across diverse domains.<n>Current testing approaches focus on acceptance-level evaluation from the user's perspective.<n>We present methods to enable structural testing of LLM-based agents.
arXiv Detail & Related papers (2026-01-25T11:52:30Z) - AutoML in Cybersecurity: An Empirical Study [0.8703011045028926]
This paper systematically evaluates eight open-source AutoML frameworks across 11 publicly available cybersecurity datasets.<n>Results show substantial performance variability across tools and datasets, with no single solution consistently superior.<n>Key challenges identified include adversarial vulnerability, model drift, and inadequate feature engineering.
arXiv Detail & Related papers (2025-09-28T03:52:46Z) - DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z) - On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software [0.577182115743694]
Automated Driving System (ADS) is a safety-critical software system responsible for the interpretation of the vehicle's environment.<n>Development of ADS requires rigorous processes to verify, validate, assess, and qualify the code before it can be deployed in the vehicle.<n>This study developed and evaluated a prototype for automatic code generation and assessment.
arXiv Detail & Related papers (2025-04-02T21:35:11Z) - SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
We propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge.<n>To explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic.<n>Our Multimodal Retrieval-Augmented Generation model leverages video, control signals, and environmental attributes to learn from past driving experiences.
arXiv Detail & Related papers (2025-02-28T21:53:47Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML [56.565200973244146]
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline.<n>Recent works have started exploiting large language models (LLM) to lessen such burden.<n>This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML.
arXiv Detail & Related papers (2024-10-03T20:01:09Z) - AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning [54.47116888545878]
AutoAct is an automatic agent learning framework for QA.
It does not rely on large-scale annotated data and synthetic planning trajectories from closed-source models.
arXiv Detail & Related papers (2024-01-10T16:57:24Z) - DARTH: Holistic Test-time Adaptation for Multiple Object Tracking [87.72019733473562]
Multiple object tracking (MOT) is a fundamental component of perception systems for autonomous driving.
Despite the urge of safety in driving systems, no solution to the MOT adaptation problem to domain shift in test-time conditions has ever been proposed.
We introduce DARTH, a holistic test-time adaptation framework for MOT.
arXiv Detail & Related papers (2023-10-03T10:10:42Z) - Towards a Complete Metamorphic Testing Pipeline [56.75969180129005]
Metamorphic Testing (MT) addresses the test oracle problem by examining the relationships between input-output pairs in consecutive executions of the System Under Test (SUT)
These relations, known as Metamorphic Relations (MRs), specify the expected output changes resulting from specific input changes.
Our research aims to develop methods and tools that assist testers in generating MRs, defining constraints, and providing explainability for MR outcomes.
arXiv Detail & Related papers (2023-09-30T10:49:22Z) - TARGET: Automated Scenario Generation from Traffic Rules for Testing Autonomous Vehicles via Validated LLM-Guided Knowledge Extraction [8.029974249105443]
TARGET is an end-to-end framework that automatically generates test scenarios from traffic rules.<n>We leverage a Large Language Model (LLM) to extract knowledge from traffic rules.<n>TARGET synthesizes executable scripts to render scenarios in simulation.
arXiv Detail & Related papers (2023-05-10T10:04:08Z) - AutoEn: An AutoML method based on ensembles of predefined Machine
Learning pipelines for supervised Traffic Forecasting [1.6242924916178283]
Traffic Forecasting (TF) is gaining relevance due to its ability to mitigate traffic congestion by forecasting future traffic states.
TF poses one big challenge to the Machine Learning paradigm, known as the Model Selection Problem (MSP)
We introduce AutoEn, which is a simple and efficient method for automatically generating multi-classifier ensembles from a predefined set of ML pipelines.
arXiv Detail & Related papers (2023-03-19T18:37:18Z) - Self-Supervised Representation Learning from Temporal Ordering of
Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks.
We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems.
Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.