Fugu-MT 論文翻訳(概要): Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs

論文の概要: Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs

arxiv url: http://arxiv.org/abs/2601.04277v1
Date: Wed, 07 Jan 2026 12:39:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 17:01:52.853642
Title: Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs
Title（参考訳）: 後LLM用デュアルアライメントキャリブレータとしての事前学習モデルの解錠
Authors: Beier Luo, Cheng Wang, Hongxin Wei, Sharon Li, Xuefeng Du,
Abstract要約: ポストトレーニングは、大きな言語モデル(LLM)を改善するが、しばしば信頼性のキャリブレーションが悪化し、体系的な過信につながる。ポストトレーニング後LM (PoLMs) の非教師的ポストホック法は、PoLMの信頼性を十分に校正されたプレトレーニング後のものと整合させることによってこれを緩和する。キャリブレーション誤差は, ほぼ一貫した中間決定過程に拘わらず, 最終信頼が膨らむ信頼ドリフトと, 中間推論経路が分岐するプロセスドリフトの2つの状態から生じることを示す。
参考スコア（独自算出の注目度）: 29.454825941938054
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training improves large language models (LLMs) but often worsens confidence calibration, leading to systematic overconfidence. Recent unsupervised post-hoc methods for post-trained LMs (PoLMs) mitigate this by aligning PoLM confidence to that of well-calibrated pre-trained counterparts. However, framing calibration as static output-distribution matching overlooks the inference-time dynamics introduced by post-training. In particular, we show that calibration errors arise from two regimes: (i) confidence drift, where final confidence inflates despite largely consistent intermediate decision processes, and (ii) process drift, where intermediate inference pathways diverge. Guided by this diagnosis, we propose Dual-Align, an unsupervised post-hoc framework for dual alignment in confidence calibration. Dual-Align performs confidence alignment to correct confidence drift via final-distribution matching, and introduces process alignment to address process drift by locating the layer where trajectories diverge and realigning the stability of subsequent inference. This dual strategy learns a single temperature parameter that corrects both drift types without sacrificing post-training performance gains. Experiments show consistent improvements over baselines, reducing calibration errors and approaching a supervised oracle.
Abstract（参考訳）: ポストトレーニングは、大きな言語モデル(LLM)を改善するが、しばしば信頼性のキャリブレーションが悪化し、体系的な過信につながる。ポストトレーニング後LM (PoLMs) の非教師的ポストホック法は、PoLMの信頼性を十分に校正されたプレトレーニング後のものと整合させることによってこれを緩和する。しかし、静的な出力分布マッチングとしてのフレーミングキャリブレーションは、ポストトレーニングによって導入された推論時間ダイナミクスを無視する。特に、キャリブレーションエラーは2つのレジームから生じることを示す。一ほぼ一貫した中間決定プロセスに拘わらず、最終的な信頼が膨らむ信頼の漂流 (ii) 中間推論経路が分岐するプロセスドリフト。この診断で導かれたDual-Alignは、信頼性校正における二重アライメントのための教師なしのポストホックフレームワークである。 Dual-Alignは、最終分布マッチングによる信頼性ドリフトの補正のための信頼性アライメントを行い、トラジェクトリが分岐する層を配置し、その後の推論の安定性を実現することによって、プロセスアライメントに対処するプロセスアライメントを導入する。この二重戦略は、トレーニング後の性能向上を犠牲にすることなく、両方のドリフトタイプを補正する単一温度パラメータを学習する。実験では、基準線よりも一貫した改善が見られ、校正誤差を低減し、監督されたオラクルに近づいた。

論文の概要: Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs

関連論文リスト