Towards Worst-Case Guarantees with Scale-Aware Interpretability
- URL: http://arxiv.org/abs/2602.05184v1
- Date: Thu, 05 Feb 2026 01:22:31 GMT
- Title: Towards Worst-Case Guarantees with Scale-Aware Interpretability
- Authors: Lauren Greenspan, David Berman, Aryeh Brill, Ro Jefferson, Artemy Kolchinsky, Jennifer Lin, Andrew Mack, Anindita Maiti, Fernando E. Rosas, Alexander Stapleton, Lucas Teixeira, Dmitry Vaintrob,
- Abstract summary: Neural networks organize information according to the hierarchical, multi-scale structure of natural data.<n>We propose a unifying research agenda -- emphscale-aware interpretability -- to develop formal machinery and interpretability tools.
- Score: 58.519943565092724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural networks organize information according to the hierarchical, multi-scale structure of natural data. Methods to interpret model internals should be similarly scale-aware, explicitly tracking how features compose across resolutions and guaranteeing bounds on the influence of fine-grained structure that is discarded as irrelevant noise. We posit that the renormalisation framework from physics can meet this need by offering technical tools that can overcome limitations of current methods. Moreover, relevant work from adjacent fields has now matured to a point where scattered research threads can be synthesized into practical, theory-informed tools. To combine these threads in an AI safety context, we propose a unifying research agenda -- \emph{scale-aware interpretability} -- to develop formal machinery and interpretability tools that have robustness and faithfulness properties supported by statistical physics.
Related papers
- RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis [53.90240071275054]
The transition toward localized intelligence through Small Language Models (SLMs) has intensified the need for rigorous performance characterization on resource-constrained edge hardware.<n>We propose a systematic framework that unifies architectural primitives and hardware constraints through the lens of operational intensity (OI)<n>By defining an inference-potential region, we introduce the Relative Inference Potential as a novel metric to compare efficiency differences between Large Language Models (LLMs) on the same hardware substrate.
arXiv Detail & Related papers (2026-02-12T03:02:22Z) - Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions [16.821238326410324]
Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque.<n>Mechanistic interpretability has emerged as a critical research direction for understanding and aligning these models.<n>We analyze how interpretability insights have informed alignment strategies including reinforcement learning from human feedback, constitutional AI, and scalable oversight.
arXiv Detail & Related papers (2026-01-21T11:43:57Z) - Explainable Neural Inverse Kinematics for Obstacle-Aware Robotic Manipulation: A Comparative Analysis of IKNet Variants [0.28544513613730205]
Deep neural networks have accelerated inverse-kinematics (IK) inference to the point where low cost manipulators can execute complex trajectories in real time.<n>This study proposes an explainability centered workflow that integrates Shapley-value attribution with physics-based obstacle avoidance evaluation.
arXiv Detail & Related papers (2025-12-29T09:02:02Z) - Circuit Insights: Towards Interpretability Beyond Activations [20.178085579725472]
We propose WeightLens and CircuitLens, two complementary methods for mechanistic interpretability.<n>WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets.<n> CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics.
arXiv Detail & Related papers (2025-10-16T17:49:41Z) - Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z) - Learning Causal Structure Distributions for Robust Planning [53.753366558072806]
We find that learning the functional relationships while accounting for the uncertainty about the structural information leads to more robust dynamics models.<n>This in contrast with common model-learning methods that ignore the causal structure and fail to leverage the sparsity of interactions in robotic systems.<n>We show that our model can be used to learn the dynamics of a robot, which together with a sampling-based planner can be used to perform new tasks in novel environments.
arXiv Detail & Related papers (2025-08-08T22:43:17Z) - Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction [57.19302613163439]
We introduce neural network reprogrammability as a unifying framework for model adaptation.<n>We present a taxonomy that categorizes such information manipulation approaches across four key dimensions.<n>We also analyze remaining technical challenges and ethical considerations.
arXiv Detail & Related papers (2025-06-05T05:42:27Z) - Information Science Principles of Machine Learning: A Causal Chain Meta-Framework Based on Formalized Information Mapping [7.299890614172539]
This study addresses key challenges in machine learning, namely the absence of a unified formal theoretical framework and the lack of foundational theories for model interpretability and ethical safety.<n>We first construct a formal information model, explicitly defining the ontological states and carrier mappings of typical machine learning stages.<n>By introducing learnable and processable predicates, as well as learning and processing functions, we analyze the causal chain logic and constraint laws governing machine learning processes.
arXiv Detail & Related papers (2025-05-19T14:39:41Z) - Counterfactual Explanations as Interventions in Latent Space [62.997667081978825]
Counterfactual explanations aim to provide to end users a set of features that need to be changed in order to achieve a desired outcome.
Current approaches rarely take into account the feasibility of actions needed to achieve the proposed explanations.
We present Counterfactual Explanations as Interventions in Latent Space (CEILS), a methodology to generate counterfactual explanations.
arXiv Detail & Related papers (2021-06-14T20:48:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.