Related papers: BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

URL: http://arxiv.org/abs/2510.26374v2
Date: Thu, 06 Nov 2025 09:27:20 GMT
Title: BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
Authors: Qianli Shen, Daoyuan Chen, Yilun Huang, Zhenqing Ling, Yaliang Li, Bolin Ding, Jingren Zhou,
Abstract summary: Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning.<n>We introduce BOTS, a unified framework for Bayesian Online Task Selection in RFT reinforcement finetuning.
Score: 82.925106913459
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce BOTS, a unified framework for Bayesian Online Task Selection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates explicit evidence from direct evaluations of selected tasks and implicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.

Related papers

Don't Just Fine-tune the Agent, Tune the Environment [25.7349297100143]
Supervised fine-tuning on synthetic data leads to overfitting.<n>Standard reinforcement learning struggles with a critical cold-start problem and training instability.<n>Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration.
arXiv Detail & Related papers (2025-10-11T12:35:15Z)
HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness [49.72591739116668]
Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs)<n>However, prevalent methods like GRPO often fail when task difficulty exceeds the model's capacity, leading to reward sparsity and inefficient training.<n>We propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework.
arXiv Detail & Related papers (2025-10-10T13:42:03Z)
Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.<n>We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.<n>We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z)
Prompt Tuning Decision Transformers with Structured and Scalable Bandits [4.460057058209513]
We propose a bandit-based prompt-tuning method that learns to construct optimal trajectory prompts from demonstration data at inference time.<n>We show that our method consistently enhances performance across a wide range of tasks, high-dimensional environments, and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-02-07T14:57:17Z)
Model Predictive Task Sampling for Efficient and Robust Adaptation [57.414812940406996]
We introduce Model Predictive Task Sampling (MPTS), a framework that bridges the task space and adaptation risk distributions.<n>MPTS employs a generative model to characterize the episodic optimization process and predicts task-specific adaptation risk via posterior inference.<n>MPTS seamlessly integrates into zero-shot, few-shot, and supervised finetuning settings.
arXiv Detail & Related papers (2025-01-19T13:14:53Z)
Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z)
Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks [101.40633115037983]
Instruction tuning (IT) achieves impressive zero-shot generalization results by training large language models (LLMs) on a massive amount of diverse tasks with instructions. How to select new tasks to improve the performance and generalizability of IT models remains an open question. We propose active instruction tuning based on prompt uncertainty, a novel framework to identify informative tasks, and then actively tune the models on the selected tasks.
arXiv Detail & Related papers (2023-11-01T04:40:05Z)
MALIBO: Meta-learning for Likelihood-free Bayesian Optimization [11.097684270692355]
We propose a novel meta-learning BO approach that bypasses the surrogate model and directly learns the utility of queries across tasks. Our method explicitly models task uncertainty and includes an auxiliary model to enable robust adaptation to new tasks. Our method demonstrates strong anytime performance and outperforms state-of-the-art meta-learning BO methods in various benchmarks.
arXiv Detail & Related papers (2023-07-07T12:57:10Z)
Semantically Aligned Task Decomposition in Multi-Agent Reinforcement Learning [56.26889258704261]
We propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA) SAMA prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning. SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods.
arXiv Detail & Related papers (2023-05-18T10:37:54Z)
Meta-Reinforcement Learning Based on Self-Supervised Task Representation Learning [23.45043290237396]
MoSS is a context-based Meta-reinforcement learning algorithm based on Self-Supervised task representation learning. On MuJoCo and Meta-World benchmarks, MoSS outperforms prior in terms of performance, sample efficiency (3-50x faster), adaptation efficiency, and generalization.
arXiv Detail & Related papers (2023-04-29T15:46:19Z)
Task-guided Disentangled Tuning for Pretrained Language Models [16.429787408467703]
We propose Task-guided Disentangled Tuning (TDT) for pretrained language models (PLMs) TDT enhances the generalization of representations by disentangling task-relevant signals from entangled representations. Experimental results on GLUE and CLUE benchmarks show that TDT gives consistently better results than fine-tuning with different PLMs.
arXiv Detail & Related papers (2022-03-22T03:11:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.