Fine-Tuning Vision-Language Models for Neutrino Event Analysis in High-Energy Physics Experiments
- URL: http://arxiv.org/abs/2508.19376v1
- Date: Tue, 26 Aug 2025 19:12:28 GMT
- Title: Fine-Tuning Vision-Language Models for Neutrino Event Analysis in High-Energy Physics Experiments
- Authors: Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi,
- Abstract summary: Vision-Language Model (VLM) for classifying neutrino interactions from pixelated detector images in high-energy physics experiments.<n>We benchmark its performance against an established CNN baseline used in experiments like NOvA and DUNE, evaluating metrics such as classification accuracy, precision, recall, and AUC-ROC.<n>Our results show that the VLM not only matches or exceeds CNN performance but also enables richer reasoning and better integration of auxiliary textual or semantic context.
- Score: 41.33501105382656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in large language models (LLMs) has shown strong potential for multimodal reasoning beyond natural language. In this work, we explore the use of a fine-tuned Vision-Language Model (VLM), based on LLaMA 3.2, for classifying neutrino interactions from pixelated detector images in high-energy physics (HEP) experiments. We benchmark its performance against an established CNN baseline used in experiments like NOvA and DUNE, evaluating metrics such as classification accuracy, precision, recall, and AUC-ROC. Our results show that the VLM not only matches or exceeds CNN performance but also enables richer reasoning and better integration of auxiliary textual or semantic context. These findings suggest that VLMs offer a promising general-purpose backbone for event classification in HEP, paving the way for multimodal approaches in experimental neutrino physics.
Related papers
- Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z) - Rethinking Visual Intelligence: Insights from Video Pretraining [75.32388528274224]
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems.<n>We investigate Video Diffusion Models (VDMs) as a promising direction for bridging the gap.
arXiv Detail & Related papers (2025-10-28T14:12:11Z) - Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs [0.27998963147546135]
We compare zero-shot and federated fine-tuning of vision-language models (VLMs) and personalized training of a compact 3D convolutional neural network (CNN3D)<n>We evaluate accuracy, calibration, and energy usage under realistic non-IID settings.<n>These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios.
arXiv Detail & Related papers (2025-10-20T15:26:43Z) - Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics [41.33501105382656]
This work explores the applications of Vision Language Models (VLMs) to the task of identifying neutrino interactions in pixelated detector data from high-energy physics experiments.<n>We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments.<n>We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions.
arXiv Detail & Related papers (2025-09-10T10:07:27Z) - Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models [5.134872455507186]
This paper presents a systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on intuitive physics tasks.<n>We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking.<n>We find that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios.
arXiv Detail & Related papers (2025-07-22T13:24:42Z) - Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z) - KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter [49.85369344101118]
We introduce KFD-NeRF, a novel dynamic neural radiance field integrated with an efficient and high-quality motion reconstruction framework based on Kalman filtering.
Our key idea is to model the dynamic radiance field as a dynamic system whose temporally varying states are estimated based on two sources of knowledge: observations and predictions.
Our KFD-NeRF demonstrates similar or even superior performance within comparable computational time and state-of-the-art view synthesis performance with thorough training.
arXiv Detail & Related papers (2024-07-18T05:48:24Z) - Large Visual-Language Models Are Also Good Classifiers: A Study of In-Context Multimodal Fake News Detection [0.18416014644193068]
This paper first assesses the FND capabilities of two notable LVLMs, CogVLM and GPT4V, in comparison to a smaller yet adeptly trained CLIP model in a zero-shot context.<n>Next, we integrate standard in-context learning (ICL) with LVLMs, noting improvements in FND performance, though limited in scope and consistency.<n>We introduce the textbfIn-context textbfMultimodal textbfFake textbfNews textbfD
arXiv Detail & Related papers (2024-07-16T09:28:23Z) - Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation [11.778576032848482]
This work enhances scientific language models by improving their inference and evaluation capabilities with minimal or no additional training.<n>We reveal intriguing insights into the behaviour and suitability of such methods while significantly surpassing state-of-the-art models.<n>We propose a novel atomic-level evaluation method leveraging off-the-shelf Natural Language Inference (NLI) models for use in the unseen chemical domain.
arXiv Detail & Related papers (2024-05-22T20:40:53Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression [37.69303106863453]
The Information Bottleneck (IB) principle offers an information-theoretic framework for analyzing the training process of deep neural networks (DNNs)
In this paper, we introduce a framework for conducting IB analysis of general NNs.
We also perform IB analysis on a close-to-real-scale, which reveals new features of the MI dynamics.
arXiv Detail & Related papers (2023-05-13T21:44:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.