Related papers: Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

URL: http://arxiv.org/abs/2511.04393v1
Date: Thu, 06 Nov 2025 14:21:22 GMT
Title: Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach
Authors: Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang,
Abstract summary: Iterative Regret-Minimization Fine-Tuning is a post-training procedure that distills low-regret decision trajectories back into the base model.<n>This reliance on model-generated reasoning avoids rigid output engineering and provides more flexible, natural-language training signals.<n>Iterative RMFT improves LLMs' DM performance across diverse models.
Score: 37.78174504569736
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle even in basic online DM problems, failing to achieve low regret or an effective exploration-exploitation tradeoff. To address this, we introduce Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories back into the base model. At each iteration, the model rolls out multiple decision trajectories, selects the k-lowest regret ones, and fine-tunes itself on them. Unlike prior methods that (a) distill action sequences from known DM algorithms or (b) rely on manually crafted chain-of-thought templates, our approach leverages the regret metric to elicit the model's own DM ability and reasoning rationales. This reliance on model-generated reasoning avoids rigid output engineering and provides more flexible, natural-language training signals. Empirical results show that Iterative RMFT improves LLMs' DM performance across diverse models - from Transformers with numerical input/output, to open-weight LLMs, and advanced closed-weight models like GPT-4o mini. Its flexibility in output and reasoning formats enables generalization across tasks with varying horizons, action spaces, reward processes, and natural-language contexts. Finally, we provide theoretical insight showing that a single-layer Transformer under this paradigm can act as a no-regret learner in a simplified setting. Overall, Iterative RMFT offers a principled and general post-training framework for enhancing LLMs' decision-making capabilities.

Related papers

DecisionLLM: Large Language Models for Long Sequence Decision Exploration [26.033533195580933]
Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning and planning tasks.<n>This work investigates the application of LLMs to offline decision making tasks.<n>By learning to align trajectory data with natural language task descriptions, our model can autoregressively predict future decisions.
arXiv Detail & Related papers (2026-01-15T07:42:02Z)
Directional Reasoning Injection for Fine-Tuning MLLMs [51.53222423215055]
Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts.<n>Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning.<n>We propose Directional Reasoning Injection for Fine-Tuning (DRIFT) to solve this problem.
arXiv Detail & Related papers (2025-10-16T18:06:46Z)
Feedback-Induced Performance Decline in LLM-Based Decision-Making [6.5990946334144756]
Large Language Models (LLMs) can extract context from natural language problem descriptions.<n>This paper studies the behaviour of these models within a Markov Decision Process (MDPs)
arXiv Detail & Related papers (2025-07-20T10:38:56Z)
Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning [95.44766931218896]
Multi-modal large language models (MLLMs) still lag behind text-based reasoning.<n>We introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable.<n>We propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO) to align the MLLM's perceptual output with the final reasoning task.
arXiv Detail & Related papers (2025-06-05T02:28:07Z)
Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration [15.711365331854614]
We introduce Dynamic Adaptation of Reasoning Trajectories (DART), a novel data adaptation framework.<n>Instead of uniformly imitating expert steps, DART employs a selective imitation strategy guided by step-wise adaptability estimation.<n>We validate DART across multiple reasoning benchmarks and model scales, demonstrating that it significantly improves generalization and data efficiency.
arXiv Detail & Related papers (2025-05-27T04:08:11Z)
Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking [61.61356842567952]
We propose STeP, a novel method for improving LLM-based agent training.<n>We synthesize self-reflected trajectories that include reflections and corrections of error steps.<n>Experiments demonstrate that our method improves agent performance across three representative tasks.
arXiv Detail & Related papers (2025-05-26T14:11:12Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Large Language Diffusion Models [93.26422905620008]
Large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs)<n>We introduce LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning paradigm.<n>Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines.
arXiv Detail & Related papers (2025-02-14T08:23:51Z)
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)
Mind's Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models [20.28989820878285]
Large language models (LLMs) have achieved remarkable advancements in natural language processing. The massive scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained environments.
arXiv Detail & Related papers (2023-11-15T18:56:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.