Related papers: Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications

Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications

URL: http://arxiv.org/abs/2507.09931v1
Date: Mon, 14 Jul 2025 05:17:41 GMT
Title: Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications
Authors: Yoon Pyo Lee,
Abstract summary: This paper presents a novel methodology for interpreting how Large Language Models encode and utilize domain-specific knowledge.<n>We adapted a general-purpose LLM to the nuclear domain using a parameter-efficient fine-tuning technique known as Low-Rank Adaptation.<n>By comparing the neuron activation patterns of the base model to those of the fine-tuned model, we identified a sparse set of neurons whose behavior was significantly altered.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The integration of Large Language Models (LLMs) into safety-critical domains, such as nuclear engineering, necessitates a deep understanding of their internal reasoning processes. This paper presents a novel methodology for interpreting how an LLM encodes and utilizes domain-specific knowledge, using a Boiling Water Reactor system as a case study. We adapted a general-purpose LLM (Gemma-3-1b-it) to the nuclear domain using a parameter-efficient fine-tuning technique known as Low-Rank Adaptation. By comparing the neuron activation patterns of the base model to those of the fine-tuned model, we identified a sparse set of neurons whose behavior was significantly altered during the adaptation process. To probe the causal role of these specialized neurons, we employed a neuron silencing technique. Our results demonstrate that while silencing most of these specialized neurons individually did not produce a statistically significant effect, deactivating the entire group collectively led to a statistically significant degradation in task performance. Qualitative analysis further revealed that silencing these neurons impaired the model's ability to generate detailed, contextually accurate technical information. This paper provides a concrete methodology for enhancing the transparency of an opaque black-box model, allowing domain expertise to be traced to verifiable neural circuits. This offers a pathway towards achieving nuclear-grade artificial intelligence (AI) assurance, addressing the verification and validation challenges mandated by nuclear regulatory frameworks (e.g., 10 CFR 50 Appendix B), which have limited AI deployment in safety-critical nuclear operations.

Related papers

Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models [68.57424628540907]
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets.<n>We introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms.<n>Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance.
arXiv Detail & Related papers (2025-07-12T08:10:10Z)
NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models [68.89389652724378]
NOBLE is a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection.<n>It predicts distributions of neural dynamics accounting for the intrinsic experimental variability.<n>NOBLE is the first scaled-up deep learning framework validated on real experimental data.
arXiv Detail & Related papers (2025-06-05T01:01:18Z)
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models [14.630626774362606]
Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content.<n>We propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints.
arXiv Detail & Related papers (2025-04-29T05:49:35Z)
Neural DNF-MT: A Neuro-symbolic Approach for Learning Interpretable and Editable Policies [51.03989561425833]
We propose a neuro-symbolic approach called neural DNF-MT for end-to-end policy learning.<n>The differentiable nature of the neural DNF-MT model enables the use of deep actor-critic algorithms for training.<n>We show how the bivalent representations of deterministic policies can be edited and incorporated back into a neural model.
arXiv Detail & Related papers (2025-01-07T15:51:49Z)
Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models [20.29451537633895]
We propose the use of causal interventions to reverse engineer neural rankers.<n>We demonstrate how mechanistic interpretability methods can be used to isolate components satisfying term-frequency axioms.
arXiv Detail & Related papers (2024-05-03T22:30:15Z)
Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks [0.6282171844772422]
An increasing depth of parametric domain knowledge in large language models (LLMs) is fueling their rapid deployment in real-world applications.<n>The recent discovery of named entities as adversarial examples in natural language processing tasks raises questions about their potential impact on the knowledge robustness of pre-trained and finetuned LLMs.<n>We developed an embedding-space attack based on powerscaled distance-weighted sampling to assess the robustness of their biomedical knowledge.
arXiv Detail & Related papers (2024-02-16T09:29:38Z)
The Role of Foundation Models in Neuro-Symbolic Learning and Reasoning [54.56905063752427]
Neuro-Symbolic AI (NeSy) holds promise to ensure the safe deployment of AI systems. Existing pipelines that train the neural and symbolic components sequentially require extensive labelling. New architecture, NeSyGPT, fine-tunes a vision-language foundation model to extract symbolic features from raw data.
arXiv Detail & Related papers (2024-02-02T20:33:14Z)
Neuro-symbolic model for cantilever beams damage detection [0.0]
We propose a neuro-symbolic model for the detection of damages in cantilever beams based on a novel cognitive architecture. The hybrid discriminative model is introduced under the name Logic Convolutional Neural Regressor.
arXiv Detail & Related papers (2023-05-04T13:12:39Z)
Explaining the Deep Natural Language Processing by Mining Textual Interpretable Features [3.819533618886143]
T-EBAnO is a prediction-local and class-based model-global explanation strategies tailored to deep natural-language models. It provides an objective, human-readable, domain-specific assessment of the reasons behind the automatic decision-making process.
arXiv Detail & Related papers (2021-06-12T06:25:09Z)
Leveraging Global Parameters for Flow-based Neural Posterior Estimation [90.21090932619695]
Inferring the parameters of a model based on experimental observations is central to the scientific method. A particularly challenging setting is when the model is strongly indeterminate, i.e., when distinct sets of parameters yield identical observations. We present a method for cracking such indeterminacy by exploiting additional information conveyed by an auxiliary set of observations sharing global parameters.
arXiv Detail & Related papers (2021-02-12T12:23:13Z)
Rectified Linear Postsynaptic Potential Function for Backpropagation in Deep Spiking Neural Networks [55.0627904986664]
Spiking Neural Networks (SNNs) usetemporal spike patterns to represent and transmit information, which is not only biologically realistic but also suitable for ultra-low-power event-driven neuromorphic implementation. This paper investigates the contribution of spike timing dynamics to information encoding, synaptic plasticity and decision making, providing a new perspective to design of future DeepSNNs and neuromorphic hardware systems.
arXiv Detail & Related papers (2020-03-26T11:13:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.