Aspen Open Jets: Unlocking LHC Data for Foundation Models in Particle Physics
- URL: http://arxiv.org/abs/2412.10504v1
- Date: Fri, 13 Dec 2024 19:00:03 GMT
- Title: Aspen Open Jets: Unlocking LHC Data for Foundation Models in Particle Physics
- Authors: Oz Amram, Luca Anzalone, Joschka Birk, Darius A. Faroughy, Anna Hallin, Gregor Kasieczka, Michael Krämer, Ian Pang, Humberto Reyes-Gonzalez, David Shih,
- Abstract summary: We introduce the AspenOpenJets dataset, consisting of approximately 180M high $p_T$ jets derived from CMS 2016 Open Data.
We show how pre-training the OmniJet-$alpha$ foundation model on AspenOpenJets improves performance on generative tasks with significant domain shift.
In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton-proton collision data, we provide the ML-ready derived AspenOpenJets dataset for further public use.
- Score: 0.5055815271772576
- License:
- Abstract: Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets dataset, consisting of approximately 180M high $p_T$ jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet-$\alpha$ foundation model on AspenOpenJets improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton-proton collision data, we provide the ML-ready derived AspenOpenJets dataset for further public use.
Related papers
- Using Federated Machine Learning in Predictive Maintenance of Jet Engines [0.0]
This paper aims to predict the Remaining Useful Life (RUL) of turbine jet engines using a federated machine learning framework.
The system aims to capture complex computation and patterns in the engine data to enhance the accuracy of RUL predictions.
arXiv Detail & Related papers (2025-02-07T20:41:36Z) - HEP-JEPA: A foundation model for collider physics using joint embedding predictive architecture [0.0]
We present a transformer architecture-based foundation model for tasks at high-energy particle colliders.
We train the model to classify jets using a self-supervised strategy inspired by the Joint Embedding Predictive Architecture.
Our model fares well with other datasets for standard classification benchmark tasks.
arXiv Detail & Related papers (2025-02-06T10:16:27Z) - DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.
In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.
For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Flow Matching Beyond Kinematics: Generating Jets with Particle-ID and
Trajectory Displacement Information [0.0]
We introduce the first generative model trained on the JetClass dataset.
Our model generates jets at the constituent level, and it is a permutation-equivariant continuous normalizing flow (CNF) trained with the flow matching technique.
For the first time, we also introduce a generative model that goes beyond the kinematic features of jet constituents.
arXiv Detail & Related papers (2023-11-30T19:00:02Z) - Pre-training on Synthetic Driving Data for Trajectory Prediction [61.520225216107306]
We propose a pipeline-level solution to mitigate the issue of data scarcity in trajectory forecasting.
We adopt HD map augmentation and trajectory synthesis for generating driving data, and then we learn representations by pre-training on them.
We conduct extensive experiments to demonstrate the effectiveness of our data expansion and pre-training strategies.
arXiv Detail & Related papers (2023-09-18T19:49:22Z) - Machine Learning Force Fields with Data Cost Aware Training [94.78998399180519]
Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation.
Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels.
We propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data.
arXiv Detail & Related papers (2023-06-05T04:34:54Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Towards Efficient Task-Driven Model Reprogramming with Foundation Models [52.411508216448716]
Vision foundation models exhibit impressive power, benefiting from the extremely large model capacity and broad training data.
However, in practice, downstream scenarios may only support a small model due to the limited computational resources or efficiency considerations.
This brings a critical challenge for the real-world application of foundation models: one has to transfer the knowledge of a foundation model to the downstream task.
arXiv Detail & Related papers (2023-04-05T07:28:33Z) - pmuBAGE: The Benchmarking Assortment of Generated PMU Data for Power
System Events -- Part I: Overview and Results [2.4775353203585797]
We present pmuGE (phasor measurement unit Generator of Events), one of the first data-driven generative model for power system event data.
We have trained this model on thousands of actual events and created a dataset denoted pmuBAGE.
The dataset consists of almost 1000 instances of labeled event data to encourage benchmark evaluations on phasor measurement unit (PMU) data analytics.
arXiv Detail & Related papers (2022-04-03T15:30:08Z) - Bridge Data Center AI Systems with Edge Computing for Actionable
Information Retrieval [0.5652468989804973]
High data rates at modern synchrotron and X-ray free-electron lasers motivate the use of machine learning methods for data reduction, feature detection, and other purposes.
We describe here how specialized data center AI systems can be used for this purpose.
arXiv Detail & Related papers (2021-05-28T16:47:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.