Fugu-MT 論文翻訳(概要): PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

論文の概要: PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

arxiv url: http://arxiv.org/abs/2510.04080v1
Date: Sun, 05 Oct 2025 07:57:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.444289
Title: PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
Title（参考訳）: PoLi-RL:条件付きセマンティックテキスト類似性のためのポイント・ツー・リスト強化学習フレームワーク
Authors: Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li,
Abstract要約: 本稿では,新しいポイントツーリスト強化学習フレームワークPoLi-RLを紹介する。 PoLi-RLは、基本的なスコアリング能力を確立するために、単純なポイントワイズでモデルを訓練する。その後、ポイントワイド、ペアワイド、リストワイドの目的を組み合わせたハイブリッド報酬に移行し、微妙なセマンティックな区別を識別するモデルの能力を洗練させる。公式のC-STSベンチマークでは、PoLi-RLは48.18のスピアマン相関係数を達成し、クロスエンコーダアーキテクチャのための新しいSOTAを確立した。
参考スコア（独自算出の注目度）: 22.289473489488955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully integrate recent breakthroughs in the NLP community concerning Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. However, we find that naively applying listwise RL fails to produce meaningful improvements, as the model is overwhelmed by complex, coarse-grained reward signals. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with simple pointwise rewards to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice comprises same-indexed completions from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful and precise paradigm for training LLMs on complex, ranking-based conditional judgment tasks.
Abstract（参考訳）: 条件付きセマンティックテキスト類似度(C-STS)は、特定の条件下でのテキストセグメント間の意味的近接度を測定し、従来のSTSに固有の曖昧さを克服する。しかし、既存の手法は差別的モデルに限られており、LLM(Large Language Models)と強化学習(Reinforcement Learning, RL)に関するNLPコミュニティの最近のブレークスルーを完全に統合することができない。 RLは、微分不可能なスピアマンランキングを直接最適化し、C-STSで要求される推論プロセスを導くことができるため、このタスクには特に適しているパラダイムである。しかし, モデルが複雑で粗い報酬信号に圧倒されているため, リストワイズRLをネーティブに適用しても有意義な改善は得られないことがわかった。この課題に対処するために,新規なPoint-to-List Reinforcement LearningフレームワークPoLi-RLを紹介する。 PoLi-RLは2段階のカリキュラムを採用しており、まず基本的なスコアリング能力を確立するために単純なポイントワイド報酬でモデルを訓練し、次にポイントワイド、ペアワイド、リストワイドの目的を組み合わせたハイブリッド報酬に移行し、微妙なセマンティックな区別を識別するモデルの能力を洗練させる。そこで本研究では, 並列スライスにおける評価報酬を計算し, それぞれのスライスに対して, 異なるサンプルから同一のインデクシングを施したParallel Slice Ranking Reward (PSRR) 機構を提案する。これにより、個々の完了ごとに正確に区別された学習信号が提供され、きめ細かいクレジット割り当てと効果的な最適化が可能になる。公式のC-STSベンチマークでは、PoLi-RLは48.18のスピアマン相関係数を達成し、クロスエンコーダアーキテクチャのための新しいSOTAを確立した。 C-STSにRLを適用した最初の研究として、複雑なランク付けに基づく条件判断タスクでLLMを訓練するための、強力で正確なパラダイムを紹介した。

論文の概要: PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

関連論文リスト