The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness
- URL: http://arxiv.org/abs/2512.01354v3
- Date: Mon, 08 Dec 2025 22:57:17 GMT
- Title: The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness
- Authors: Zhongjie Jiang,
- Abstract summary: This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text.<n>We introduce the Prompt-driven Cognitive Computing Framework (PMCSF) that reverse-engineers unstructured text into structured cognitive vectors.<n>Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain.
- Score: 0.284279467589473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.
Related papers
- Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset [0.764671395172401]
We present a diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark.<n>We uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a weighted F1-score of 0.32 across models.<n>A massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data.
arXiv Detail & Related papers (2025-12-20T23:08:18Z) - A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z) - Uncertainty-Aware Data-Efficient AI: An Information-Theoretic Perspective [48.073471560778984]
In context-specific applications such as robotics, telecommunications, and healthcare, artificial intelligence systems often face the challenge of limited training data.<n>This review paper examines formal methodologies that address data-limited regimes through two complementary approaches.
arXiv Detail & Related papers (2025-12-04T21:44:22Z) - The Catastrophic Paradox of Human Cognitive Frameworks in Large Language Model Evaluation: A Comprehensive Empirical Analysis of the CHC-LLM Incompatibility [0.0]
Models achieving above-average human IQ scores simultaneously exhibit binary accuracy rates approaching zero on crystallized knowledge tasks.<n>This disconnect appears most strongly in the crystallized intelligence domain.<n>We propose a framework for developing native machine cognition assessments that recognize the non-human nature of artificial intelligence.
arXiv Detail & Related papers (2025-11-23T05:49:57Z) - Knowledge-based anomaly detection for identifying network-induced shape artifacts [3.29352273631268]
This work introduces a novel knowledge-based anomaly detection method for detecting network-induced shape artifacts in synthetic images.<n>We demonstrate the effectiveness of the method for identifying network-induced shape artifacts in two synthetic mammography datasets.
arXiv Detail & Related papers (2025-11-06T18:19:49Z) - AI Agentic Vulnerability Injection And Transformation with Optimized Reasoning [2.918225266151982]
We present AVIATOR, the first AI-agentic vulnerability injection workflow.<n>It automatically injects realistic, category-specific vulnerabilities for high-fidelity, diverse, large-scale vulnerability dataset generation.<n>It combines semantic analysis, injection synthesis enhanced with LoRA-based fine-tuning and Retrieval-Augmented Generation, as well as post-injection validation via static analysis and LLM-based discriminators.
arXiv Detail & Related papers (2025-08-28T14:59:39Z) - Valid Inference with Imperfect Synthetic Data [39.10587411316875]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can greatly improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - How Good Are Synthetic Requirements ? Evaluating LLM-Generated Datasets for AI4RE [0.5156484100374059]
This paper presents an enhanced Product Line approach for generating synthetic requirements data.<n>We investigate four research questions assessing how prompting strategies, automated prompt optimization, and post-generation curation affect data quality.<n>Our results show that synthetic requirements can match or outperform human-authored ones for specific tasks.
arXiv Detail & Related papers (2025-06-26T10:52:07Z) - Dynamic Programming Techniques for Enhancing Cognitive Representation in Knowledge Tracing [125.75923987618977]
We propose the Cognitive Representation Dynamic Programming based Knowledge Tracing (CRDP-KT) model.<n>It is a dynamic programming algorithm to optimize cognitive representations based on the difficulty of the questions and the performance intervals between them.<n>It provides more accurate and systematic input features for subsequent model training, thereby minimizing distortion in the simulation of cognitive states.
arXiv Detail & Related papers (2025-06-03T14:44:48Z) - Shifting AI Efficiency From Model-Centric to Data-Centric Compression [67.45087283924732]
We argue that the focus of research for AI is shifting from model-centric compression to data-centric compression.<n>Data-centric compression improves AI efficiency by directly compressing the volume of data processed during model training or inference.<n>Our work aims to provide a novel perspective on AI efficiency, synthesize existing efforts, and catalyze innovation to address the challenges posed by ever-increasing context lengths.
arXiv Detail & Related papers (2025-05-25T13:51:17Z) - Learning to Generate and Evaluate Fact-checking Explanations with Transformers [10.970249299147866]
Research contributes to the field of Explainable Artificial Antelligence (XAI)
We develop transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations.
We emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements.
arXiv Detail & Related papers (2024-10-21T06:22:51Z) - SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data [78.70620682374624]
We introduce SynFER, a novel framework for synthesizing facial expression image data based on high-level textual descriptions.<n>To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique and a pseudo-label generator.<n>Results validate the efficacy of our approach and the synthetic data.
arXiv Detail & Related papers (2024-10-13T14:58:21Z) - Generative Model-Driven Synthetic Training Image Generation: An Approach
to Cognition in Rail Defect Detection [12.584718477246382]
This study proposes a VAE-based synthetic image generation technique for rail defects.
It is applied to create a synthetic dataset for the Canadian Pacific Railway.
500 synthetic samples are generated with a minimal reconstruction loss of 0.021.
arXiv Detail & Related papers (2023-12-31T04:34:58Z) - A Discrepancy Aware Framework for Robust Anomaly Detection [51.710249807397695]
We present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies.
Our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance.
Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance.
arXiv Detail & Related papers (2023-10-11T15:21:40Z) - CausalDialogue: Modeling Utterance-level Causality in Conversations [83.03604651485327]
We have compiled and expanded upon a new dataset called CausalDialogue through crowd-sourcing.
This dataset includes multiple cause-effect pairs within a directed acyclic graph (DAG) structure.
We propose a causality-enhanced method called Exponential Average Treatment Effect (ExMATE) to enhance the impact of causality at the utterance level in training neural conversation models.
arXiv Detail & Related papers (2022-12-20T18:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.