Fugu-MT 論文翻訳(概要): Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

論文の概要: Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

arxiv url: http://arxiv.org/abs/2510.01243v1
Date: Wed, 24 Sep 2025 03:40:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.7469
Title: Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing
Title（参考訳）: 自己回帰リワード誘導表現編集による大規模言語モデルのデトックス化
Authors: Yisong Xiao, Aishan Liu, Siyuan Liang, Zonghao Ying, Xianglong Liu, Dacheng Tao,
Abstract要約: 大規模言語モデル(LLM)は、様々なタスクにわたって印象的なパフォーマンスを示してきたが、有害なコンテンツの生成には弱いままである。 textscAutoregressive textscReward textscGuided textscRe presentation textscEditing (ARGRE)を提案する。 ARGREは遅延表現空間内の毒性遷移を明示的にモデル化し、安定かつ正確な報酬誘導編集を可能にする。
参考スコア（独自算出の注目度）: 77.75609817898035
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.
Abstract（参考訳）: 大規模言語モデル(LLM)は、さまざまなタスクにわたって印象的なパフォーマンスを示しているが、有害なコンテンツの生成に弱いままであり、安全で責任のあるデプロイメントを保証するためにデトキシフィケーション戦略を必要とする。通常、LSM表現に静的または動的介入を導入するテスト時解毒法は、柔軟性と最小侵襲性のために有望な解決策を提供する。しかし、現在のアプローチは、主に毒性と非毒性の出力間の遷移空間の探索が不十分なため、しばしば不正確な介入に悩まされる。この課題に対処するために、潜伏表現空間内の毒性遷移を明示的にモデル化し、安定かつ高精度な報酬誘導編集を可能にする新しいテスト時間解毒フレームワークである \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE) を提案する。 ARGREは、毒性のないセマンティックな方向を特定し、毒性と非毒性の表現を補間し、微粒な遷移軌道を明らかにする。これらのトラジェクトリはスパース毒性アノテーションを高密度なトレーニング信号に変換し、安定かつ正確な編集ガイダンスを提供する自己回帰報酬モデルの構築を可能にする。まず、期待される報酬ギャップに基づいて方向性のステアリングを行い、非毒性領域に表現をシフトさせ、続いて軽量な勾配に基づく改良を行う。広く使われている8つのLCMの大規模な実験により、ARGREは主要なベースライン(-62.21%の毒性)と効率(-47.58%の推論時間)を著しく上回り、最小限の劣化でオリジナルのモデルのコア能力を保っている。私たちのコードはウェブサイトで入手可能です。

論文の概要: Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

関連論文リスト