Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images
- URL: http://arxiv.org/abs/2506.14934v1
- Date: Tue, 17 Jun 2025 19:32:04 GMT
- Title: Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images
- Authors: Md Abrar Jahin, Shahriar Soudeep, Arian Rahman Aditta, M. F. Mridha, Nafiz Fahad, Md. Jakir Hossen,
- Abstract summary: Vision Transformer (ViT) architectures are renowned for modeling global contextual information.<n>ViT-based models consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy.<n>This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.
Related papers
- Pan-Arctic Permafrost Landform and Human-built Infrastructure Feature Detection with Vision Transformers and Location Embeddings [1.2895931807247418]
Vision Transformers (ViTs) offer advantages in capturing long-range dependencies and global context via attention mechanisms.<n>ViTs support pretraining via self-supervised learning-addressing the common limitation of labeled data in Arctic feature detection.<n>This work investigates: (1) the suitability of pre-trained ViTs as feature extractors for high-resolution Arctic remote sensing tasks, and (2) the benefit of combining image and location embeddings.
arXiv Detail & Related papers (2025-06-03T13:34:01Z) - Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks.<n>We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself.<n>We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z) - Spectral-Enhanced Transformers: Leveraging Large-Scale Pretrained Models for Hyperspectral Object Tracking [35.34526230299484]
This paper proposes an effective methodology that adapts transformer-based foundation models for hyperspectral object tracking.<n>We propose an adaptive, learnable spatial-spectral token fusion module that can be extended to any transformer-based backbone.
arXiv Detail & Related papers (2025-02-26T01:46:21Z) - On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery [0.0]
Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor.
This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures for binary classification tasks in SSS imagery.
ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics.
arXiv Detail & Related papers (2024-09-18T14:36:50Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - HEAL-ViT: Vision Transformers on a spherical mesh for medium-range weather forecasting [0.14504054468850663]
We present HEAL-ViT, a novel architecture that uses ViT models on a spherical mesh.
HEAL-ViT produces weather forecasts that outperform the ECMWF IFS on key metrics.
arXiv Detail & Related papers (2024-02-14T22:10:52Z) - Hypergraph Transformer for Semi-Supervised Classification [50.92027313775934]
We propose a novel hypergraph learning framework, HyperGraph Transformer (HyperGT)
HyperGT uses a Transformer-based neural network architecture to effectively consider global correlations among all nodes and hyperedges.
It achieves comprehensive hypergraph representation learning by effectively incorporating global interactions while preserving local connectivity patterns.
arXiv Detail & Related papers (2023-12-18T17:50:52Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors [1.4609888393206634]
We study scalable machine learning models for event reconstruction in electron-positron collisions based on a full detector simulation.
We compare a graph neural network and kernel-based transformer and demonstrate that we can avoid operations while achieving realistic reconstruction.
The best graph neural network model shows improvement in the jet transverse momentum resolution by up to 50% compared to the rule-based algorithm.
arXiv Detail & Related papers (2023-09-13T08:16:15Z) - Equivariant Graph Neural Networks for Charged Particle Tracking [1.6626046865692057]
EuclidNet is a novel symmetry-equivariant GNN for charged particle tracking.
We benchmark it against the state-of-the-art Interaction Network on the TrackML dataset.
Our results show that EuclidNet achieves near-state-of-the-art performance at small model scales.
arXiv Detail & Related papers (2023-04-11T15:43:32Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.