Direct Multi-Turn Preference Optimization for Language Agents
- URL: http://arxiv.org/abs/2406.14868v5
- Date: Sun, 23 Feb 2025 20:00:36 GMT
- Title: Direct Multi-Turn Preference Optimization for Language Agents
- Authors: Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, Fuli Feng,
- Abstract summary: Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents.<n>Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors.<n>Applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function.
- Score: 44.02877245158347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss. The code is available at https://github.com/swt-user/DMPO.
Related papers
- AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering [52.67783579040657]
AceGRPO is a machine learning system that prioritizes tasks at the agent's learning frontier to maximize learning efficiency.<n>Our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines.
arXiv Detail & Related papers (2026-02-08T10:55:03Z) - DLLM Agent: See Farther, Run Faster [94.74432470237817]
Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties.<n>We study this in a controlled setting by instantiatingDLLM and AR backbones within the same agent workflow.<n>We find thatDLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup.
arXiv Detail & Related papers (2026-02-07T09:01:18Z) - PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning [50.81994347448835]
We propose PEARL, a reinforcement-learning framework that augments language agent with an external memory module and optimized round-wise reward design.<n>Experiments on CalBench show that PEARL achieves 0.76 error reduction rate, and 55% in average error rate compared to the strongest baseline.
arXiv Detail & Related papers (2026-01-17T08:19:18Z) - Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z) - ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation [57.399685080574756]
Existing MLLM-based VLN methods rely on imitation learning (IL) and often use DAgger for post-training.<n>We propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL.<n>Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods.
arXiv Detail & Related papers (2025-09-16T03:31:46Z) - Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning [11.045096250408067]
Tree of Agents (TOA) is a multi-agent reasoning framework that segments the input into chunks processed by independent agents.<n>TOA enables agents to probe different reasoning orders for multi-perspective understanding.<n>To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies.
arXiv Detail & Related papers (2025-09-08T08:34:02Z) - Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z) - Continual Optimization with Symmetry Teleportation for Multi-Task Learning [73.28772872740744]
Multi-task learning (MTL) enables the simultaneous learning of multiple tasks using a single model.
We propose a novel approach based on Continual Optimization with Symmetry Teleportation (COST)
COST seeks an alternative loss-equivalent point on the loss landscape to reduce conflict gradients.
arXiv Detail & Related papers (2025-03-06T02:58:09Z) - Length-Controlled Margin-Based Preference Optimization without Reference Model [11.878496378814045]
We propose Length-Controlled Margin-Based Preference Optimization (LMPO) for preference-based reinforcement learning.
A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within the Bradley-Terry framework.
Our experimental results demonstrate that LMPO effectively controls response length, reduces probability degradation, and outperforms existing approaches.
arXiv Detail & Related papers (2025-02-20T15:30:27Z) - Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents [5.566936703366701]
Division-of-Thoughts (DoT) is a collaborative reasoning framework leveraging the synergy between local and cloud-based language models.
DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.
arXiv Detail & Related papers (2025-02-06T02:40:25Z) - Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization [50.485788083202124]
Reinforcement Learning (RL) plays a crucial role in aligning large language models with human preferences and improving their ability to perform complex tasks.
We introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model.
Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
arXiv Detail & Related papers (2024-10-11T23:29:20Z) - StablePrompt: Automatic Prompt Tuning using Reinforcement Learning for Large Language Models [21.556184207901115]
Reinforcement Learning (RL) is widely used for prompt tuning, but its inherent instability and environmental dependency make it difficult to use in practice.
We propose StablePrompt, which strikes a balance between training stability and search space, mitigating the instability of RL and producing high-performance prompts.
arXiv Detail & Related papers (2024-10-10T06:35:51Z) - Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer [10.338170161831496]
Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks.
We introduce the Language model-d Prompt Transformer (LPDT), which leverages pre-trained language models for meta-RL tasks and fine-tunes the model using Low-rank Adaptation (LoRA)
Our approach integrates pre-trained language model and RL tasks seamlessly.
arXiv Detail & Related papers (2024-08-02T17:25:34Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization via Multiplier Induced Loss Landscape Scheduling [4.499391876093543]
We address the online choice of weight multipliers for multi-objective optimization of many loss terms parameterized by neural works.
Our method is multiplier-free and operates at the timescale of epochs.
It also circumvents the excessive memory requirements and heavy computational burden of existing multi-objective deep learning methods.
arXiv Detail & Related papers (2024-03-20T16:38:26Z) - Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model [86.9619638550683]
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data.
However, these models display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of decision shortcuts''
arXiv Detail & Related papers (2024-03-01T09:01:53Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Beyond Reverse KL: Generalizing Direct Preference Optimization with
Diverse Divergence Constraints [26.274786600234876]
The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but amplify safety concerns.
RLHF has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model.
DPO has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint.
We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified
arXiv Detail & Related papers (2023-09-28T08:29:44Z) - Prompt-Tuning Decision Transformer with Preference Ranking [83.76329715043205]
We propose the Prompt-Tuning DT algorithm to address challenges by using trajectory segments as prompts to guide RL agents in acquiring environmental information.
Our approach involves randomly sampling a Gaussian distribution to fine-tune the elements of the prompt trajectory and using preference ranking function to find the optimization direction.
Our work contributes to the advancement of prompt-tuning approaches in RL, providing a promising direction for optimizing large RL agents for specific preference tasks.
arXiv Detail & Related papers (2023-05-16T17:49:04Z) - Generalizing LTL Instructions via Future Dependent Options [7.8578244861940725]
This paper proposes a novel multi-task algorithm with improved learning efficiency and optimality.
In order to propagate the rewards of satisfying future subgoals back more efficiently, we propose to train a multi-step function conditioned on the subgoal sequence.
In experiments on three different domains, we evaluate the generalization capability of the agent trained by the proposed algorithm.
arXiv Detail & Related papers (2022-12-08T21:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.