ColliderML: The First Release of an OpenDataDetector High-Luminosity Physics Benchmark Dataset
- URL: http://arxiv.org/abs/2512.15230v1
- Date: Wed, 17 Dec 2025 09:30:44 GMT
- Title: ColliderML: The First Release of an OpenDataDetector High-Luminosity Physics Benchmark Dataset
- Authors: Doğa Elitez, Paul Gessinger, Daniel Murnane, Marcus Selchou Raaholt, Andreas Salzburger, Stine Kofoed Skov, Andreas Stefl, Anna Zaborowska,
- Abstract summary: ColliderML is a dataset of fully simulated and digitised proton-proton collisions in High-Luminosity Large Hadron Collider conditions.<n>It provides one million events across ten Standard Model and Beyond Standard Model processes, plus extensive single-particle samples.<n>The release fills a major gap for machine learning (ML) research on detector-level data, provided on the ML-friendly Hugging Face platform.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce ColliderML - a large, open, experiment-agnostic dataset of fully simulated and digitised proton-proton collisions in High-Luminosity Large Hadron Collider conditions ($\sqrt{s}=14$ TeV, mean pile-up $μ= 200$). ColliderML provides one million events across ten Standard Model and Beyond Standard Model processes, plus extensive single-particle samples, all produced with modern next-to-leading order matrix element calculation and showering, realistic per-event pile-up overlay, a validated OpenDataDetector geometry, and standard reconstructions. The release fills a major gap for machine learning (ML) research on detector-level data, provided on the ML-friendly Hugging Face platform. We present physics coverage and the generation, simulation, digitisation and reconstruction pipeline, describe format and access, and initial collider physics benchmarks.
Related papers
- DexVLG: Dexterous Vision-Language-Grasp Model at Scale [59.5613919093295]
There is little research on functional grasping with large models for human-like dexterous hands.<n>We introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions.<n>We generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions.
arXiv Detail & Related papers (2025-07-03T16:05:25Z) - Mic-hackathon 2024: Hackathon on Machine Learning for Electron and Scanning Probe Microscopy [54.24356756795849]
Microscopy is a primary source of information on materials structure and functionality at nanometer and atomic scales.<n>The adoption of Data Management Plans (DMPs) by major funding agencies promotes preservation and access.<n> deriving insights remains difficult due to the lack of standardized code ecosystems, benchmarks, and integration strategies.
arXiv Detail & Related papers (2025-06-10T03:54:36Z) - Fine-tuning machine-learned particle-flow reconstruction for new detector geometries in future colliders [1.988691274281547]
We demonstrate transfer learning capabilities in a machine-learned algorithm trained for particle-flow reconstruction in high energy particle colliders.<n>To our knowledge, this is the first full-simulation cross-detector transfer learning study for particle-flow reconstruction.
arXiv Detail & Related papers (2025-02-28T19:16:01Z) - DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.<n>In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.<n>For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z) - Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation [51.20656279478878]
MATRIX is a multi-agent simulator that automatically generates diverse text-based scenarios.<n>We introduce MATRIX-Gen for controllable and highly realistic data synthesis.<n>On AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model.
arXiv Detail & Related papers (2024-10-18T08:01:39Z) - Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation.
In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model.
We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z) - The LHCb ultra-fast simulation option, Lamarr: design and validation [0.46369270610100627]
Detailed detector simulation is the major consumer of CPU resources at LHCb.
Lamarr is a Gaudi-based framework designed to offer the fastest solution for the simulation of the LHCb detector.
arXiv Detail & Related papers (2023-09-22T23:21:27Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Lamarr: LHCb ultra-fast simulation based on machine learning models deployed within Gauss [0.0]
We discuss Lamarr, a framework to speed-up the simulation production parameterizing both the detector response and the reconstruction algorithms of the LHCb experiment.
Deep Generative Models powered by several algorithms and strategies are employed to effectively parameterize the high-level response of the single components of the LHCb detector.
arXiv Detail & Related papers (2023-03-20T20:18:04Z) - pmuBAGE: The Benchmarking Assortment of Generated PMU Data for Power
System Events -- Part I: Overview and Results [2.4775353203585797]
We present pmuGE (phasor measurement unit Generator of Events), one of the first data-driven generative model for power system event data.
We have trained this model on thousands of actual events and created a dataset denoted pmuBAGE.
The dataset consists of almost 1000 instances of labeled event data to encourage benchmark evaluations on phasor measurement unit (PMU) data analytics.
arXiv Detail & Related papers (2022-04-03T15:30:08Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - The Dark Machines Anomaly Score Challenge: Benchmark Data and Model
Independent Event Classification for the Large Hadron Collider [0.0]
We describe the outcome of a data challenge conducted as part of the Dark Machines Initiative and the Les Houches 2019 workshop on Physics at TeV colliders.
The challenged aims at detecting signals of new physics at the LHC using unsupervised machine learning algorithms.
arXiv Detail & Related papers (2021-05-28T18:00:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.