Fugu-MT 論文翻訳(概要): When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

論文の概要: When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

arxiv url: http://arxiv.org/abs/2602.07381v1
Date: Sat, 07 Feb 2026 05:52:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 20:26:24.595694
Title: When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified
Title（参考訳）: モデルが「コメントなし」だった時、私たちの新しいヘルプフルネスは死んだ、正直は生きていた、そして安全は脅かされた
Authors: Gautam Siddharth Kashyap, Mark Dras, Usman Naseem,
Abstract要約: 大規模言語モデル(LLM)は、人的価値に応じて、有用で、無害で、誠実(HHH)でなければならない。既存の作業では、SFT(Supervised Fine-Tuning)とMoE(Mixture-of-Experts)を使用してLCMを調整している。破滅的な忘れ込みを軽減し、推論信頼性を向上させる2段階フレームワークであるAlignXを提案する。
参考スコア（独自算出の注目度）: 19.134202394422285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.
Abstract（参考訳）: 大規模言語モデル(LLM)は、安全なデプロイメントには、人的価値 – 有用で、無害で、誠実な(HHH) – に従っている必要があります。既存の作品では、SFT (Supervised Fine-Tuning) とMixture-of-Experts (MoE) を使用してLCMを調整している。しかしながら、これらの作業は、SFTのような多目的設定における課題に直面し、競合する目的間の干渉を引き起こす一方、MoEは誤校正されたルーティングに悩まされる。我々は,(1)破滅的忘れを生じさせる特徴空間と(2)誤った専門家による信頼できない推測を特徴とする,この障害モードの軸崩壊(Axis Collapse)を述べる。これを解決するために、我々は2段階フレームワークであるAlignXを提案する。ステージ1では、プロンプトインジェクションされた微調整を使用して、軸固有のタスクの特徴を抽出し、破滅的な忘れを軽減している。ステージ2は、フラクタルと自然幾何学を使って専門家のルーティングを調整し、推論信頼性を向上させるMoCaEモジュールをデプロイする。 AlignX は Alpaca (Helpfulness), Beaver Tails (Harmlessness), TruthfulQA (Honesty), +171.5% の勝利率,+110.1% の真偽非形式性,4.3% の安全性侵害を達成している。また、以前のMoEと比べてレイテンシとメモリ使用量を35%以上削減する。 4つの LLM にまたがる結果は、その一般化性を検証する。

論文の概要: When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

関連論文リスト