Related papers: Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models

Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models

URL: http://arxiv.org/abs/2512.00918v1
Date: Sun, 30 Nov 2025 14:52:11 GMT
Title: Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models
Authors: Cen Lu, Yung-Chen Tang, Andrea Cavallaro,
Abstract summary: Large Vision-Language Models (LVLMs) have shown impressive multimodal understanding capabilities, yet their robustness is poorly understood.<n>In this paper, we investigate the structural vulnerabilities of LVLMs to identify any critical neurons whose removal triggers catastrophic collapse.
Score: 17.186414423941482
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) have shown impressive multimodal understanding capabilities, yet their robustness is poorly understood. In this paper, we investigate the structural vulnerabilities of LVLMs to identify any critical neurons whose removal triggers catastrophic collapse. In this context, we propose CAN, a method to detect Consistently Activated Neurons and to locate critical neurons by progressive masking. Experiments on LLaVA-1.5-7b-hf and InstructBLIP-Vicuna-7b reveal that masking only a tiny portion of the language model's feed-forward networks (just as few as four neurons in extreme cases) suffices to trigger catastrophic collapse. Notably, critical neurons are predominantly localized in the language model rather than in the vision components, and the down-projection layer is a particularly vulnerable structure. We also observe a consistent two-stage collapse pattern: initial expressive degradation followed by sudden, complete collapse. Our findings provide important insights for safety research in LVLMs.

Related papers

Robust Spiking Neural Networks Against Adversarial Attacks [49.08210314590693]
Spiking Neural Networks (SNNs) represent a promising paradigm for energy-efficient neuromorphic computing.<n>In this study, we theoretically demonstrate that threshold-neighboring spiking neurons are the key factors limiting the robustness of directly trained SNNs.<n>We find that these neurons set the upper limits for the maximum potential strength of adversarial attacks and are prone to state-flipping under minor disturbances.
arXiv Detail & Related papers (2026-02-24T05:06:12Z)
Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering [60.23509717784518]
Existing mitigation methods predominantly focus on output-level adjustments, leaving internal mechanisms that give rise to hallucinations largely unexplored.<n>We propose Contrastive Neuron Steering ( CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs.<n> CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations.
arXiv Detail & Related papers (2026-01-31T09:21:04Z)
H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs [56.31565301428888]
We identify hallucination-associated neurons (H-Neurons) in large language models (LLMs)<n>In terms of identification, we demonstrate that a remarkably sparse subset of neurons can reliably predict hallucination occurrences.<n>In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors.
arXiv Detail & Related papers (2025-12-01T15:32:14Z)
The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities [16.20947034847556]
Large Language Models (LLMs) have become foundational tools in natural language processing.<n>Recent research has found that a small subset of biological neurons in the human brain are crucial for core cognitive functions.
arXiv Detail & Related papers (2025-10-11T14:39:09Z)
Probability Distribution Collapse: A Critical Bottleneck to Compact Unsupervised Neural Grammar Induction [13.836565669337057]
Unsupervised neural grammar induction aims to learn interpretable hierarchical structures from language data.<n>Existing models face a bottleneck, often resulting in unnecessarily large yet underperforming grammars.<n>We analyze when and how the collapse emerges across key components of neural parameterization and introduce a targeted solution.
arXiv Detail & Related papers (2025-09-25T04:31:14Z)
Probing Neural Topology of Large Language Models [12.298921317333452]
We introduce graph probing, a method for uncovering the functional connectivity of large language models.<n>By probing models across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance.<n>Strikingly, probing on topology outperforms probing on activation by up to 130.4%.
arXiv Detail & Related papers (2025-06-01T14:57:03Z)
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models [56.61984030508691]
We present the first mechanistic interpretability study of language confusion.<n>We show that confusion points (CPs) are central to this phenomenon.<n>We show that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion.
arXiv Detail & Related papers (2025-05-22T11:29:17Z)
Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations [2.759846687681801]
Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior.<n>This suggests a limited degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control.<n>We introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns.
arXiv Detail & Related papers (2025-05-19T22:32:25Z)
Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence [46.548276232795466]
Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control.<n>We map the polysemantic topology of two small models to identify feature pairs that are semantically unrelated yet exhibit interference within models.<n>We intervene at four loci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models.
arXiv Detail & Related papers (2025-05-16T18:20:42Z)
Detecting Neurocognitive Disorders through Analyses of Topic Evolution and Cross-modal Consistency in Visual-Stimulated Narratives [83.15653194899126]
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management.<n>Current VSN-based NCD detection methods primarily focus on linguistic microstructures closely tied to bottom-up, stimulus-driven cognitive processes.<n>We propose two novel macrostructural approaches: a Dynamic Topic Model (DTM) to track topic evolution over time, and a Text-Image Temporal Alignment Network (TITAN) to measure cross-modal consistency between narrative and visual stimuli.
arXiv Detail & Related papers (2025-01-07T12:16:26Z)
Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies [7.21603206617401]
We show that larger GPT-2 models require a disproportionately larger share of attention heads to be masked/ablated to display degradation magnitude to masking. These results suggest that the attention mechanism in transformer models may present an analogue to the notions of cognitive and brain reserve.
arXiv Detail & Related papers (2024-06-05T00:31:50Z)
Neural Language Models are not Born Equal to Fit Brain Data, but Training Helps [75.84770193489639]
We examine the impact of test loss, training corpus and model architecture on the prediction of functional Magnetic Resonance Imaging timecourses of participants listening to an audiobook. We find that untrained versions of each model already explain significant amount of signal in the brain by capturing similarity in brain responses across identical words. We suggest good practices for future studies aiming at explaining the human language system using neural language models.
arXiv Detail & Related papers (2022-07-07T15:37:17Z)
Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence [14.817015950058915]
We propose emphNeuron-level Inverse Perturbation (NIP), a novel defense against general adversarial attacks. It calculates neuron influence from benign examples and then modifies input examples by generating inverse perturbations.
arXiv Detail & Related papers (2021-12-24T13:37:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.