Related papers: MLPerf Automotive

MLPerf Automotive

URL: http://arxiv.org/abs/2510.27065v1
Date: Fri, 31 Oct 2025 00:28:14 GMT
Title: MLPerf Automotive
Authors: Radoyeh Shojaei, Predrag Djurdjevic, Mostafa El-Khamy, James Goel, Kasper Mecklenburg, John Owens, Pınar Muyan-Özçelik, Tom St. John, Jinho Suh, Arjun Suresh,
Abstract summary: This benchmark addresses the need for standardized performance evaluation methodologies in automotive machine learning systems.<n>Our benchmarking framework provides latency and accuracy metrics along with evaluation protocols.<n>We describe the methodology behind the benchmark design including the task selection, reference models, and submission rules.
Score: 3.096098336940615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present MLPerf Automotive, the first standardized public benchmark for evaluating Machine Learning systems that are deployed for AI acceleration in automotive systems. Developed through a collaborative partnership between MLCommons and the Autonomous Vehicle Computing Consortium, this benchmark addresses the need for standardized performance evaluation methodologies in automotive machine learning systems. Existing benchmark suites cannot be utilized for these systems since automotive workloads have unique constraints including safety and real-time processing that distinguish them from the domains that previously introduced benchmarks target. Our benchmarking framework provides latency and accuracy metrics along with evaluation protocols that enable consistent and reproducible performance comparisons across different hardware platforms and software implementations. The first iteration of the benchmark consists of automotive perception tasks in 2D object detection, 2D semantic segmentation, and 3D object detection. We describe the methodology behind the benchmark design including the task selection, reference models, and submission rules. We also discuss the first round of benchmark submissions and the challenges involved in acquiring the datasets and the engineering efforts to develop the reference implementations. Our benchmark code is available at https://github.com/mlcommons/mlperf_automotive.

Related papers

Easy Data Unlearning Bench [53.1304932656586]
We introduce a unified and benchmarking suite that simplifies the evaluation of unlearning algorithms.<n>By standardizing setup and metrics, it enables reproducible, scalable, and fair comparison across unlearning methods.
arXiv Detail & Related papers (2026-02-18T12:20:32Z)
Uncovering Competency Gaps in Large Language Models and Their Benchmarks [11.572508874955659]
We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps.<n>We found that models consistently underperformed on concepts that stand in contrast to sycophantic behaviors.<n>Our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores.
arXiv Detail & Related papers (2025-12-06T17:39:47Z)
Detect Anything via Next Point Prediction [51.55967987350882]
Rex- Omni is a 3B-scale MLLM that achieves state-of-the-art object perception performance.<n>On benchmarks like COCO and LVIS, Rex- Omni attains performance comparable to or exceeding regression-based models.
arXiv Detail & Related papers (2025-10-14T17:59:54Z)
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents [17.296425855109426]
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents.<n>TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks.<n>We implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models.
arXiv Detail & Related papers (2025-05-19T16:11:23Z)
AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results [55.33807002543901]
We present AIvaluateXR, a comprehensive evaluation framework for benchmarking large language models (LLMs) running on XR devices.<n>We deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation.<n>We propose a unified evaluation method based on the 3D Optimality theory to select the optimal device-model pairs from quality and speed objectives.
arXiv Detail & Related papers (2025-02-13T20:55:48Z)
Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation [160.07938471250048]
Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics. We develop strong-performing automatic metrics for reference-based summarization evaluation.
arXiv Detail & Related papers (2023-03-07T02:49:50Z)
SynMotor: A Benchmark Suite for Object Attribute Regression and Multi-task Learning [0.0]
This benchmark can be used for computer vision tasks including 2D/3D detection, classification, segmentation, and multi-attribute learning. Most attributes of the motors are quantified as continuously variable rather than binary, which makes our benchmark well-suited for the less explored regression tasks.
arXiv Detail & Related papers (2023-01-11T18:27:29Z)
PDEBENCH: An Extensive Benchmark for Scientific Machine Learning [20.036987098901644]
We introduce PDEBench, a benchmark suite of time-dependent simulation tasks based on Partial Differential Equations (PDEs) PDEBench comprises both code and data to benchmark the performance of novel machine learning models against both classical numerical simulations and machine learning baselines.
arXiv Detail & Related papers (2022-10-13T17:03:36Z)
BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video. There is little interaction between them due to the use of disparate benchmark datasets and metrics. We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z)
SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation [111.61261419566908]
Deep neural networks (DNNs) are usually trained on a closed set of semantic classes. They are ill-equipped to handle previously-unseen objects. detecting and localizing such objects is crucial for safety-critical applications such as perception for automated driving.
arXiv Detail & Related papers (2021-04-30T07:58:19Z)
Exploring and Analyzing Machine Commonsense Benchmarks [0.13999481573773073]
We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies. We describe our initial MCS Benchmark Ontology, an common vocabulary that formalizes benchmark metadata.
arXiv Detail & Related papers (2020-12-21T19:01:55Z)
MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking [72.76685780516371]
We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) The benchmark is focused on multiple people tracking, since pedestrians are by far the most studied object in the tracking community. We provide a categorization of state-of-the-art trackers and a broad error analysis.
arXiv Detail & Related papers (2020-10-15T06:52:16Z)
AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite [26.820244556465333]
This paper proposes an agile domain-specific benchmarking methodology. We identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We present the first end-to-end Internet service AI benchmark.
arXiv Detail & Related papers (2020-02-17T07:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.