Fugu-MT 論文翻訳(概要): Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

論文の概要: Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

arxiv url: http://arxiv.org/abs/2604.12967v1
Date: Tue, 14 Apr 2026 17:00:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.574285
Title: Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training
Title（参考訳）: Cycle-Consistent Search:サーチエージェント訓練のためのプロキシ・リワードとしての質問再構成可能性
Authors: Sohyun An, Shuibenyang Yuan, Hayeon Lee, Cho-Jui Hsieh, Alexander Min,
Abstract要約: Cycle-Consistent Searchは、検索エージェントを訓練するための金色のスーパービジョンのないフレームワークである。 CCSは教師付きベースラインに匹敵する性能を示す。これらの結果から,CCSは金の監督が不可能な環境で検索エージェントを訓練するためのスケーラブルな訓練パラダイムを提供する可能性が示唆された。
参考スコア（独自算出の注目度）: 80.20022221643414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.
Abstract（参考訳）: 強化学習(RL)は,複雑な情報検索タスクにおいて,探索エージェントを最適化する強力な可能性を示している。しかし、既存のアプローチは金の監督に大きく依存している。この制限に対処するために,教師なし機械翻訳とイメージ・ツー・イメージ翻訳のサイクル一貫性技術に触発されて,検索エージェントを訓練するためのゴールド・スーパービジョン・フリー・フレームワークであるCycle-Consistent Search (CCS)を提案する。我々のキーとなる仮説は、最適探索軌跡は、不十分なものや無関係なものとは異なり、質問の意図の無意味なエンコーディングとして機能する、というものである。したがって、高品質な軌道は、元の質問を正確に再構築するために必要な情報を保持し、政策最適化のための報酬シグナルを誘導する必要がある。しかし, 周期整合性の目的は情報漏洩に弱いため, 探索過程ではなく表面の語彙的手がかりに頼っている可能性がある。この効果を低減するために、最終応答の排除や検索クエリの名前付きエンティティ認識(NER)マスキングなどの情報ボトルネックを適用した。これらの制約により、復元は構造的な足場とともに回収された観測に頼らざるを得なくなり、結果として得られる報酬信号が言語的冗長性よりも情報的妥当性を反映することを保証する。質問応答ベンチマークの実験では、CCSは監督された基準線に匹敵する性能を達成し、金の監督に依存しない先行手法よりも優れていた。これらの結果から,CCSは金の監督が不可能な環境で検索エージェントを訓練するためのスケーラブルな訓練パラダイムを提供する可能性が示唆された。

論文の概要: Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

関連論文リスト