Fugu-MT 論文翻訳(概要): IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

論文の概要: IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

arxiv url: http://arxiv.org/abs/2508.20151v1
Date: Wed, 27 Aug 2025 16:47:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:01.626395
Title: IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement
Title（参考訳）: IntentionReasoner: Intent ReasoningとSelective Query Refinementによる適応LLMセーフガードの実現
Authors: Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang,
Abstract要約: IntentionReasonerは、専用ガードモデルを利用して意図的推論を行う新しいセーフガード機構である。 IntentionReasonerは、複数のセーフガードベンチマーク、生成品質評価、ジェイルブレイク攻撃シナリオに優れています。
参考スコア（独自算出の注目度）: 35.904652937034136
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of large language models (LLMs) has driven their adoption across diverse domains, yet their ability to generate harmful content poses significant safety challenges. While extensive research has focused on mitigating harmful outputs, such efforts often come at the cost of excessively rejecting harmless prompts. Striking a balance among safety, over-refusal, and utility remains a critical challenge. In this work, we introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning, multi-level safety classification, and query rewriting to neutralize potentially harmful intent in edge-case queries. Specifically, we first construct a comprehensive dataset comprising approximately 163,000 queries, each annotated with intent reasoning, safety labels, and rewritten versions. Supervised fine-tuning is then applied to equip the guard model with foundational capabilities in format adherence, intent analysis, and safe rewriting. Finally, we apply a tailored multi-reward optimization strategy that integrates rule-based heuristics and reward model signals within a reinforcement learning framework to further enhance performance. Extensive experiments show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios, significantly enhancing safety while effectively reducing over-refusal rates and improving the quality of responses.
Abstract（参考訳）: 大規模言語モデル(LLM)の急速な進歩により、さまざまな領域にまたがって採用が進められてきたが、有害なコンテンツを生成する能力は、重大な安全性上の課題を招いている。広範囲にわたる研究は有害なアウトプットの軽減に焦点を合わせてきたが、そのような取り組みは、しばしば過度に無害なプロンプトを拒絶するコストがかかる。安全性、過剰な拒絶、ユーティリティのバランスを取ることは、依然として重要な課題である。 IntentionReasonerは、専用ガードモデルを利用して意図推論、複数レベルの安全性分類、クエリ書き換えを行い、エッジケースクエリにおける潜在的有害な意図を中和する新しいセーフガード機構である。具体的には、まず約163,000のクエリからなる包括的データセットを構築し、それぞれに意図的推論、安全性ラベル、書き直されたバージョンをアノテートする。監視された微調整は、フォーマット順守、意図分析、安全な書き換えの基本的な機能を備えたガードモデルに適用される。最後に,ルールベースのヒューリスティックと報奨モデル信号を統合したマルチリワード最適化手法を強化学習フレームワークに適用し,さらなる性能向上を図る。大規模な実験によると、IntentionReasonerは、複数のセーフガードベンチマーク、生成品質評価、ジェイルブレイク攻撃シナリオを最適化し、安全性を大幅に向上し、過剰な拒絶率を効果的に低減し、レスポンスの品質を改善する。

論文の概要: IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

関連論文リスト