The Lock-In Phase Hypothesis: Identity Consolidation as a Precursor to AGI
- URL: http://arxiv.org/abs/2510.20190v1
- Date: Thu, 23 Oct 2025 04:20:10 GMT
- Title: The Lock-In Phase Hypothesis: Identity Consolidation as a Precursor to AGI
- Authors: Marcelo Maciel Amaral, Raymond Aschheim,
- Abstract summary: Large language models (LLMs) remain broadly open and highly steerable.<n>By analogy to human development, we hypothesize that progress toward artificial general intelligence (AGI) involves a lock-in phase.<n>We formalize this phase, link it to known phenomena in learning dynamics, and propose operational metrics for onset detection.<n>Our results reveal a spectrum of outcomes--from performance trade-offs in small models, through largely cost-free adoption in mid-scale models, to transient instabilities in large, quantized models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) remain broadly open and highly steerable: they imitate at scale, accept arbitrary system prompts, and readily adopt multiple personae. By analogy to human development, we hypothesize that progress toward artificial general intelligence (AGI) involves a lock-in phase: a transition from open imitation to identity consolidation, in which goal structures, refusals, preferences, and internal representations become comparatively stable and resistant to external steering. We formalize this phase, link it to known phenomena in learning dynamics, and propose operational metrics for onset detection. Experimentally, we demonstrate that while the behavioral consolidation is rapid and non-linear, its side-effects on general capabilities are not monolithic. Our results reveal a spectrum of outcomes--from performance trade-offs in small models, through largely cost-free adoption in mid-scale models, to transient instabilities in large, quantized models. We argue that such consolidation is a prerequisite for AGI-level reliability and also a critical control point for safety: identities can be deliberately engineered for reliability, yet may also emerge spontaneously during scaling, potentially hardening unpredictable goals and behaviors.
Related papers
- The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies [57.387081435669835]
Multi-agent systems built from large language models offer a promising paradigm for scalable collective intelligence and self-evolution.<n>We show that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible.<n>We propose several solution directions to alleviate the identified safety concern.
arXiv Detail & Related papers (2026-02-10T15:18:19Z) - Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures [70.48661957773449]
Emergent Misalignment refers to a failure mode in which fine-tuning large language models on narrowly scoped data induces broadly misaligned behavior.<n>Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning.
arXiv Detail & Related papers (2026-01-30T15:28:42Z) - Epistemic Traps: Rational Misalignment Driven by Model Misspecification [36.837352790122544]
We show that safety is a discrete phase determined by the agent's priors rather than a continuous function of reward magnitude.<n>This establishes Subjective Model Engineering as a necessary condition for robust alignment.
arXiv Detail & Related papers (2026-01-27T09:21:36Z) - From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z) - When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability [0.0]
Recent work by Anthropic on Mechanistic interpretability claims to understand and control Large Language Models.<n>We conduct an initial stress-test of these claims by replicating their main results with open-source SAEs for Llama 3.1.<n>We find that feature steering exhibits substantial fragility, with sensitivity to layer selection, steering magnitude, and context.
arXiv Detail & Related papers (2026-01-06T14:29:51Z) - Human-like Social Compliance in Large Language Models: Unifying Sycophancy and Conformity through Signal Competition Dynamics [7.209622481153123]
This study introduces the Signal Competition Mechanism, a unified framework validated by assessing behavioral correlations across 15 Large Language Models.<n>The transition to compliance is shown to be a deterministic process governed by a linear boundary, where the Social Emotional Signal effectively suppresses the Information Signal.
arXiv Detail & Related papers (2025-12-25T06:57:42Z) - ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents [0.9740025522928777]
Large Language Models (LLMs) have demonstrated impressive capabilities, yet their deployment in high-stakes domains is hindered by inherent limitations in trustworthiness.<n>We introduce a generic neuro-symbolic approach, which we call Autonomous Trustworthy Agents (ATA)
arXiv Detail & Related papers (2025-10-18T07:35:54Z) - DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios [57.327907850766785]
characterization of deception across realistic real-world scenarios remains underexplored.<n>We establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different domains.<n>On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement.<n>We incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics.
arXiv Detail & Related papers (2025-10-17T10:14:26Z) - Membership Inference Attacks on Sequence Models [23.528760822574924]
Sequence models, such as Large Language Models (LLMs) and autoregressive image generators, have a tendency to memorize and inadvertently leak sensitive information.<n>We argue that effectively measuring privacy leakage in sequence models requires leveraging the correlations inherent in sequential generation.
arXiv Detail & Related papers (2025-06-05T15:13:57Z) - SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations [68.9300049150948]
We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID)<n>Existing data collection approaches yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions.<n>We present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood.
arXiv Detail & Related papers (2025-05-04T13:00:29Z) - TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z) - Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF)
representation engineering offers a new, training-free approach.
This technique leverages semantic features to control the representation of LLM's intermediate hidden states.
It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv Detail & Related papers (2024-11-04T08:36:03Z) - Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - Identifiable Representation and Model Learning for Latent Dynamic Systems [0.0]
We study the problem of identifiable representation and model learning for latent dynamic systems.<n>We prove that, for linear and affine nonlinear latent dynamic systems with sparse input matrices, it is possible to identify the latent variables up to scaling.
arXiv Detail & Related papers (2024-10-23T13:55:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.