FuguReport

Enhancing LLM Metacognition via Cognitive Pairwise Training

Authors Weitao Li, Hao Zhou, Xuanyu Lei, Fandong Meng, Yuanhang Liu, Jingyi Ren, Ante Wang, Xiaolong Wang, Yuanchi Zhang, Fuwen Luo, Guangwen Yang, Lin Gan, Weizhi Ma, Yang Liu
Affiliations Tsinghua University / Tencent
Categories Method / Model Training / Intermediate alignment stage training, Task / Confidence Estimation / Enhancing metacognitive confidence, Evaluation / Model Calibration / Uncertainty and confidence evaluation
License CC BY 4.0

Abstract Overview

This paper proposes Cognitive Pairwise Training (CPT), a metacognitive mid-training stage that teaches language models to compare pairs of reasoning traces and distinguish more trustworthy reasoning from flawed reasoning. The method constructs difficulty-balanced pairwise data from multi-model rollouts, labels pairs with a strong teacher using a four-way comparison scheme, and then trains the policy model on these comparative judgments before standard math SFT and RL. The central motivation is that outcome-level RL rewards can improve answer accuracy while weakening a model’s ability to recognize uncertainty and abstain appropriately. Across multiple model scales and families, the authors evaluate whether CPT improves the trade-off between mathematical reasoning performance and prompt-independent abstention.

Novelty

The distinctive contribution is framing metacognitive alignment as an intermediate pairwise reasoning-trace comparison task rather than as response-side refusal tuning or post-hoc calibration. CPT uses a reusable four-way comparative supervision signal over reasoning traces to help models internalize a reasoning-quality boundary that is intended to persist through later RL.

Results

Across Qwen3 4B-14B models, CPT+RL achieves the best reported math average at each scale while remaining among the strongest methods on Normal-Prompt abstention; at 14B it improves over the standard SFT+RL pipeline by 2.2 math-average points and 5.6 Normal-Prompt abstention-F1 points. The paper also reports that CPT better preserves abstention under subsequent RL, transfers zero-shot to conflicting-source RAG settings, and remains effective with a self-distilled 32B in-house judge.

Key Points

  1. CPT trains models to compare paired reasoning traces, using intra-model, inter-model, and counter-intuitive pairs plus self-consistent teacher labeling to supervise reasoning-quality discrimination.
  2. The main empirical claim is an improved reasoning-metacognition trade-off: better or competitive math performance together with stronger prompt-independent abstention than standard SFT+RL, DPO+RL, and abstention-RL baselines.
  3. Analysis suggests CPT changes reasoning behavior rather than only surface refusal behavior, with better trace quality on controlled pairwise audits and greater robustness to the abstention degradation often introduced by math RL.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.