Related papers: Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework

Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework

URL: http://arxiv.org/abs/2601.08839v1
Date: Wed, 17 Dec 2025 16:42:45 GMT
Title: Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework
Authors: Toshiyuki Shigemura,
Abstract summary: This paper presents a tri-agent cross-validation framework for analyzing stability and explainability in large language systems.<n>The architecture integrates three heterogeneous LLMs used for semantic generation, analytical consistency checking, and transparency auditing.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a tri-agent cross-validation framework for analyzing stability and explainability in multi-model large language systems. The architecture integrates three heterogeneous LLMs-used for semantic generation, analytical consistency checking, and transparency auditing-into a recursive interaction cycle. This design induces Recursive Knowledge Synthesis (RKS), where intermediate representations are continuously refined through mutually constraining transformations irreducible to single-model behavior. Across 47 controlled trials using public-access LLM deployments (October 2025), we evaluated system stability via four metrics: Reflex Reliability Score (RRS), Transparency Score (TS), Deviation Detection Rate (DDR), and Correction Success Rate (CSR). The system achieved mean RRS = 0.78+-0.06 and maintained TS >= 0.8 in about 68% of trials. Approximately 89% of trials converged, supporting the theoretical prediction that transparency auditing acts as a contraction operator within the composite validation mapping. The contributions are threefold: (1) a structured tri-agent framework for coordinated reasoning across heterogeneous LLMs, (2) a formal RKS model grounded in fixed-point theory, and (3) empirical evaluation of inter-model stability under realistic, non-API public-access conditions. These results provide initial empirical evidence that a safety-preserving, humansupervised multi-LLM architecture can achieve stable recursive knowledge synthesis in realistic, publicly deployed environments.

Related papers

Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure [58.89643769707751]
We study latent chain-of-thought as a manipulable causal process in representation space.<n>We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing.<n>These results motivate mode-conditional and stability-aware analyses as more reliable tools for interpreting and improving latent reasoning systems.
arXiv Detail & Related papers (2026-02-09T15:25:12Z)
EvalQReason: A Framework for Step-Level Reasoning Evaluation in Large Language Models [0.8399688944263844]
We present EvalQReason, a framework that quantifies LLM reasoning quality through step-level probability distribution analysis.<n>The framework introduces two complementary algorithms: Consecutive Step Divergence (CSD), which measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC), which assesses global alignment with final answers.
arXiv Detail & Related papers (2026-02-02T16:32:40Z)
Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals [13.89434979851652]
Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs.<n>We present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction.
arXiv Detail & Related papers (2026-02-01T02:35:59Z)
Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z)
BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition [12.973657570368317]
This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains.<n>The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion
arXiv Detail & Related papers (2026-01-01T15:13:11Z)
CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks [96.64597365827046]
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks.<n>We introduce a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity.<n>We show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks.
arXiv Detail & Related papers (2025-11-01T04:37:01Z)
Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning [29.778703252962092]
Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs)<n>We develop a novel test-time reward mechanism that operates without external supervision.
arXiv Detail & Related papers (2025-10-20T07:53:51Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Enhancing ALS Progression Tracking with Semi-Supervised ALSFRS-R Scores Estimated from Ambient Home Health Monitoring [3.210027230758067]
Self-attention achieved the lowest prediction error for subscale-level models.<n>We identified distinct homogeneity-hegeneity profiles across functional domains.
arXiv Detail & Related papers (2025-07-13T02:56:40Z)
CTRLS: Chain-of-Thought Reasoning via Latent State-Transition [57.51370433303236]
Chain-of-thought (CoT) reasoning enables large language models to break down complex problems into interpretable intermediate steps.<n>We introduce groundingS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions.<n>We show improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
arXiv Detail & Related papers (2025-07-10T21:32:18Z)
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions [8.069858557211132]
Large Language Models (LLMs) have shown remarkable capabilities across various tasks.<n>Their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction.<n>This paper introduces a comprehensive framework for evaluating and improving LLM response consistency.
arXiv Detail & Related papers (2025-03-28T11:49:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.