PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
- URL: http://arxiv.org/abs/2510.04080v1
- Date: Sun, 05 Oct 2025 07:57:26 GMT
- Title: PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
- Authors: Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li,
- Abstract summary: We introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework.<n>PoLi-RL trains a model with simple pointwise rewards to establish fundamental scoring capabilities.<n>It then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions.<n>On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture.
- Score: 22.289473489488955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully integrate recent breakthroughs in the NLP community concerning Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. However, we find that naively applying listwise RL fails to produce meaningful improvements, as the model is overwhelmed by complex, coarse-grained reward signals. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with simple pointwise rewards to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice comprises same-indexed completions from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful and precise paradigm for training LLMs on complex, ranking-based conditional judgment tasks.
Related papers
- ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking [84.07076200941474]
ArenaRL is a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking.<n>We construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals.<n>Experiments show that ArenaRL substantially outperforms standard RL baselines.
arXiv Detail & Related papers (2026-01-10T08:43:07Z) - Coupled Variational Reinforcement Learning for Language Model General Reasoning [83.82392089177841]
We propose textitbCoupled bVari bReinforcement bLearning (CoVRL) to bridge variational inference and reinforcement learning.<n>CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines.
arXiv Detail & Related papers (2025-12-14T07:03:51Z) - LORE: A Large Generative Model for Search Relevance [23.808303249081117]
We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search.<n> Deployed and iterated over three years, LORE achieves a cumulative +27% improvement in online GoodRate metrics.
arXiv Detail & Related papers (2025-12-02T18:50:42Z) - RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion [9.252819624397405]
We propose RL-AD-Net, a reinforcement learning framework that operates in the latent space of a pretrained point autoencoder.<n>To ensure robustness, a lightweight non-parametric PointNN selector evaluates the geometric consistency of both the original completion and the RL-refined output.<n>Experiments on ShapeNetCore-2048 demonstrate that while baseline completion networks perform reasonable under their training-style cropping, they struggle in random cropping scenarios.
arXiv Detail & Related papers (2025-11-21T08:55:55Z) - CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks [96.64597365827046]
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks.<n>We introduce a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity.<n>We show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks.
arXiv Detail & Related papers (2025-11-01T04:37:01Z) - RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? [92.4931695205957]
We introduce DELTA-Code, a benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability and transferrability.<n>Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy.<n>To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop.
arXiv Detail & Related papers (2025-09-25T11:20:56Z) - CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity [20.349897901019574]
We introduce CoDiEmb, a unified framework for training unified text embeddings.<n>CoDiEmb integrates three key innovations for effective joint optimization.<n>Our results and analysis demonstrate that the framework mitigates cross-task trade-offs.
arXiv Detail & Related papers (2025-08-15T12:46:35Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning [26.835266813794316]
We first propose CLS-RL for MLLM image classification, using verifiable rewards for fine-tuning.<n>We then rethink and question whether explicit thinking in RFT is always necessary.<n>No-Thinking-RL explores RFT without thinking by introducing a simple equality accuracy reward.
arXiv Detail & Related papers (2025-03-20T14:37:45Z) - SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks [110.20297293596005]
Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks.<n>Existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs.<n>We propose a novel RL algorithm, SWEET-RL, that uses a carefully designed optimization objective to train a critic model with access to additional training-time information.<n>Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms.
arXiv Detail & Related papers (2025-03-19T17:55:08Z) - On the Emergence of Thinking in LLMs I: Searching for the Right Intuition [34.32871896067864]
We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP)<n> RLSP involves three steps: supervised fine-tuning with human or synthetic demonstrations of the reasoning process, using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and RL training with an outcome verifier to ensure correctness while preventing reward hacking.<n> Empirical studies in the math domain show that RLSP improves reasoning.
arXiv Detail & Related papers (2025-02-10T18:52:04Z) - SD-LoRA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning [73.93639228235622]
Continual Learning with foundation models has emerged as a promising paradigm to exploit abundant knowledge acquired during pre-training for tackling sequential tasks.<n>Existing prompt-based and Low-Rank Adaptation-based (LoRA-based) methods often require expanding a prompt/LoRA pool or retaining samples of previous tasks.<n>We propose Scalable Decoupled LoRA (SD-LoRA) for class incremental learning, which continually separates the learning of the magnitude and direction of LoRA components without rehearsal.
arXiv Detail & Related papers (2025-01-22T20:00:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.