Related papers: GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models

GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models

URL: http://arxiv.org/abs/2509.12108v2
Date: Tue, 16 Sep 2025 05:13:41 GMT
Title: GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models
Authors: Min Zeng, Jingfei Sun, Xueyou Luo, Caiquan Liu, Shiqi Zhang, Li Xie, Xiaoxin Chen,
Abstract summary: In natural language processing tasks, pure reinforcement learning (RL) fine-tuning methods often suffer from inefficient exploration and slow convergence.<n>We propose the Guess-Think-Answer (GTA) framework that combines the efficiency of SFT with the capability gains of RL in a unified training paradigm.<n>This hybrid approach achieves both faster convergence than pure RL and higher performance ceiling than pure SFT.
Score: 8.233245059144355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In natural language processing tasks, pure reinforcement learning (RL) fine-tuning methods often suffer from inefficient exploration and slow convergence; while supervised fine-tuning (SFT) methods, although efficient in training, have limited performance ceiling and less solid theoretical foundation compared to RL. To address efficiency-capability trade-off, we propose the Guess-Think-Answer (GTA) framework that combines the efficiency of SFT with the capability gains of RL in a unified training paradigm. GTA works by having the model first produce a provisional guess (optimized via cross-entropy loss), then reflect on this guess before generating the final answer, with RL rewards shaping both the final output and the format of the entire GTA structure. This hybrid approach achieves both faster convergence than pure RL and higher performance ceiling than pure SFT. To mitigate gradient conflicts between the two training signals, we employ loss masking and gradient constraints. Empirical results on four text classification benchmarks demonstrate that GTA substantially accelerates convergence while outperforming both standalone SFT and RL baselines.

Related papers

SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning [54.393763477932474]
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs)<n>We propose SED-SFT, which adaptively encourages diversity based on the token exploration space.<n>This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective.
arXiv Detail & Related papers (2026-02-07T09:39:21Z)
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training [10.433802085981046]
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL)<n>We show that RL increases SFT loss under SFT optimality and that SFT lowers the reward achieved by RL.<n> Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training.
arXiv Detail & Related papers (2026-01-12T10:14:09Z)
Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z)
Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners [28.039145840787683]
Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting.<n>Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting.<n>We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT.
arXiv Detail & Related papers (2025-10-06T03:01:14Z)
Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning [36.06085913761571]
Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature.<n>This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms.
arXiv Detail & Related papers (2025-09-08T17:58:02Z)
Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [53.239242017802056]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z)
The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs [66.17068546293487]
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning.<n>We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks.<n>We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones.
arXiv Detail & Related papers (2025-07-10T09:05:49Z)
Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions [28.962415274754537]
Large language model (LLM) reasoning has shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL)<n>We introduce a novel training approach, textbfReLIFT (textbfReinforcement textbfL textbfInterleaved with Online textbfFine-textbfTuning)<n>In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternate
arXiv Detail & Related papers (2025-06-09T08:11:20Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z)
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.