The Algorithmic Unconscious: Structural Mechanisms and Implicit Biases in Large Language Models
- URL: http://arxiv.org/abs/2602.18468v1
- Date: Sun, 08 Feb 2026 16:03:43 GMT
- Title: The Algorithmic Unconscious: Structural Mechanisms and Implicit Biases in Large Language Models
- Authors: Philippe Boisnard,
- Abstract summary: This article introduces the concept of the algorithmic unconscious to designate the set of structural determinations that operate within large language models (LLMs)<n>We argue that a significant class of biases emerges directly from the technical mechanisms of the models themselves: tokenization, attention, statistical optimization, and alignment procedures.<n>We propose a framework for a technical clinic of models, grounded in the audit of tokenization regimes, latent space topology, and alignment systems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This article introduces the concept of the algorithmic unconscious to designate the set of structural determinations that operate within large language models (LLMs) without being accessible either to the model's own reflexivity or to that of its users. In contrast to approaches that reduce AI bias solely to dataset composition or to the projection of human intentionality, we argue that a significant class of biases emerges directly from the technical mechanisms of the models themselves: tokenization, attention, statistical optimization, and alignment procedures. By framing bias as an infrastructural phenomenon, this approach resolves a central theoretical ambiguity surrounding responsibility, neutrality, and correction in contemporary LLMs. Based on a comparative analysis of tokenization across a corpus of parallel sentences, we show that Arabic languages (Modern Standard Arabic and Maghrebi dialects) undergo a systematic inflation in token count relative to English, with ratios ranging from 1.6x to nearly 4x depending on the infrastructure (OpenAI, Anthropic, SentencePiece/Mistral). This over-segmentation constitutes a measurable infrastructural bias that mechanically increases inference costs, constrains access to contextual space, and alters attentional weighting within model representations. We relate these empirical findings to three additional structural mechanisms: causal bias (correlation vs causation), the erasure of minoritized features through dimensional collapse, and normative biases induced by safety alignment. Finally, we propose a framework for a technical clinic of models, grounded in the audit of tokenization regimes, latent space topology, and alignment systems, as a necessary condition for the critical appropriation of AI infrastructures.
Related papers
- Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models [77.98801218316505]
Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning.<n>We investigate the internal processing of LLMs during in-context concept inference.
arXiv Detail & Related papers (2026-02-08T03:14:39Z) - Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models [57.42778606399764]
Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation.<n>Current reinforcement learning approaches often rely on sparse, outcome-based rewards.<n>We argue that this stems from a fundamental mismatch with the natural structure of reasoning.
arXiv Detail & Related papers (2025-10-02T00:34:15Z) - Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models [93.1043186636177]
We explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations.<n>We propose a computational implementation of this idea -- a Model Synthesis Architecture''<n>We evaluate our MSA as a model of human judgments on a novel reasoning dataset.
arXiv Detail & Related papers (2025-07-16T18:01:03Z) - Modeling and Visualization Reasoning for Stakeholders in Education and Industry Integration Systems: Research on Structured Synthetic Dialogue Data Generation Based on NIST Standards [3.5516803380598074]
This study addresses the structural complexity and semantic ambiguity in stakeholder interactions within the Education-Industry Integration (EII) system.<n>We propose a structural modeling paradigm based on the National Institute of Standards and Technology (NIST) synthetic data quality framework.
arXiv Detail & Related papers (2025-06-20T12:37:43Z) - Information Science Principles of Machine Learning: A Causal Chain Meta-Framework Based on Formalized Information Mapping [7.299890614172539]
This study addresses key challenges in machine learning, namely the absence of a unified formal theoretical framework and the lack of foundational theories for model interpretability and ethical safety.<n>We first construct a formal information model, explicitly defining the ontological states and carrier mappings of typical machine learning stages.<n>By introducing learnable and processable predicates, as well as learning and processing functions, we analyze the causal chain logic and constraint laws governing machine learning processes.
arXiv Detail & Related papers (2025-05-19T14:39:41Z) - Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures [49.19753720526998]
We derive theoretical scaling laws for neural network performance on synthetic datasets.<n>We validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance.<n>This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.
arXiv Detail & Related papers (2025-05-11T17:44:14Z) - Large Language Models as Quasi-crystals: Coherence Without Repetition in Generative Text [0.0]
essay proposes an analogy between large language models (LLMs) and quasicrystals, systems that exhibit global coherence without periodic repetition, generated through local constraints.<n> Drawing on the history of quasicrystals, it highlights an alternative mode of coherence in generative language: constraint-based organization without repetition or symbolic intent.<n>This essay aims to reframe the current discussion around large language models, not by rejecting existing methods, but by suggesting an additional axis of interpretation grounded in structure rather than semantics.
arXiv Detail & Related papers (2025-04-16T11:27:47Z) - Fine-Grained Bias Detection in LLM: Enhancing detection mechanisms for nuanced biases [0.0]
This study presents a detection framework to identify nuanced biases in Large Language Models (LLMs)<n>The approach integrates contextual analysis, interpretability via attention mechanisms, and counterfactual data augmentation to capture hidden biases.<n>Results show improvements in detecting subtle biases compared to conventional methods.
arXiv Detail & Related papers (2025-03-08T04:43:01Z) - Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning [29.745218855471787]
Tokenization is a necessary component within the current architecture of many language mod-els.<n>We argue that tokenization is necessary for reasonably human-like language performance.<n>We discuss implications for architectural choices, meaning construction, the primacy of language for thought.
arXiv Detail & Related papers (2024-12-14T18:18:52Z) - Interpreting token compositionality in LLMs: A robustness analysis [10.777646083061395]
Constituent-Aware Pooling (CAP) is a methodology designed to analyse how large language models process linguistic structures.<n>CAP intervenes in model activations through constituent-based pooling at various model levels.<n>Our findings highlight fundamental limitations in current transformer architectures regarding compositional semantics processing and model interpretability.
arXiv Detail & Related papers (2024-10-16T18:10:50Z) - The Foundations of Tokenization: Statistical and Computational Concerns [51.370165245628975]
Tokenization is a critical step in the NLP pipeline.<n>Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood.<n>The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models.
arXiv Detail & Related papers (2024-07-16T11:12:28Z) - What and How does In-Context Learning Learn? Bayesian Model Averaging,
Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions.
We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm.
We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.