Related papers: Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

URL: http://arxiv.org/abs/2509.06701v1
Date: Mon, 08 Sep 2025 13:55:01 GMT
Title: Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks
Authors: Su Hyeong Lee, Risi Kondor, Richard Ngo,
Abstract summary: We develop a theory of intelligent agency grounded in probabilistic modeling for neural models.<n>We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes.
Score: 7.4145864319417285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'") induces an antagonistic counterpart ("Waluigi"), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

Related papers

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z)
Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks [0.0]
We develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property.<n>We demonstrate that training deep neural networks (DNNs) with mini-batch descent (SGD) achieves global optima of empirical risk.<n>We derive deterministic and probabilistic bounds on generalization error based on conditional generalized entropy measures.
arXiv Detail & Related papers (2026-02-18T04:26:55Z)
Learning a Generative Meta-Model of LLM Activations [75.30161960337892]
We create "meta-models" that learn the distribution of a network's internal states.<n>Applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases.<n>These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions.
arXiv Detail & Related papers (2026-02-06T18:59:56Z)
Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality [52.57416398859353]
We show that causal minimality can endow latent representations of diffusion vision and autoregressive language models with clear causal interpretation and robust, component-wise identifiable control.<n>We introduce a novel theoretical framework for hierarchical selection models, where higher-level concepts emerge from the constrained composition of lower-level variables.<n>These causally grounded concepts serve as levers for fine-grained model steering, paving the way for transparent, reliable systems.
arXiv Detail & Related papers (2025-12-11T14:59:14Z)
Exploring Syntropic Frameworks in AI Alignment: A Philosophical Investigation [0.0]
I argue that AI alignment should be reconceived as architecting syntropic, reasons-responsive agents through process-based, multi-agent, developmental mechanisms.<n>I articulate the specification trap'' argument demonstrating why content-based value specification appears structurally unstable.<n>I propose syntropy as an information-theoretic framework for understanding multi-agent alignment dynamics.
arXiv Detail & Related papers (2025-11-19T23:31:29Z)
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails [103.05296856071931]
We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving Large Language Model (LLM) agents.<n>ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies.<n>Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states.
arXiv Detail & Related papers (2025-10-06T14:48:39Z)
Modeling GRNs with a Probabilistic Categorical Framework [6.929340252997961]
This work introduces the Probabilistic Categorical GRN(PC-GRN) framework.<n>It is a novel theoretical approach founded on the synergistic integration of three core methodologies.<n>The framework provides a mathematically rigorous, biologically interpretable, and uncertainty-aware representation of GRNs.
arXiv Detail & Related papers (2025-08-16T14:06:53Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Parallelly Tempered Generative Adversarial Nets: Toward Stabilized Gradients [7.94957965474334]
A generative adversarial network (GAN) has been a representative backbone model in generative artificial intelligence (AI)<n>This work analyzes the training instability and inefficiency in the presence of mode collapse by linking it to multimodality in the target distribution.<n>With our newly developed GAN objective function, the generator can learn all the tempered distributions simultaneously.
arXiv Detail & Related papers (2024-11-18T18:01:13Z)
Interpretable Imitation Learning with Dynamic Causal Relations [65.18456572421702]
We propose to expose captured knowledge in the form of a directed acyclic causal graph. We also design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs. The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner.
arXiv Detail & Related papers (2023-09-30T20:59:42Z)
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations [62.65877150123775]
Causal abstraction is a promising theoretical framework for explainable artificial intelligence. Existing causal abstraction methods require a brute-force search over alignments between the high-level model and the low-level one. We present distributed alignment search (DAS), which overcomes these limitations.
arXiv Detail & Related papers (2023-03-05T00:57:49Z)
Generalization Properties of Optimal Transport GANs with Latent Distribution Learning [52.25145141639159]
We study how the interplay between the latent distribution and the complexity of the pushforward map affects performance. Motivated by our analysis, we advocate learning the latent distribution as well as the pushforward map within the GAN paradigm.
arXiv Detail & Related papers (2020-07-29T07:31:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.