VORTEX: Aligning Task Utility and Human Preferences through LLM-Guided Reward Shaping
- URL: http://arxiv.org/abs/2509.16399v1
- Date: Fri, 19 Sep 2025 20:22:13 GMT
- Title: VORTEX: Aligning Task Utility and Human Preferences through LLM-Guided Reward Shaping
- Authors: Guojun Xiong, Milind Tambe,
- Abstract summary: In social impact optimization, AI decision systems often rely on solvers that optimize well-calibrated mathematical objectives.<n>Recent approaches address this by using large language models to generate new reward functions from preference descriptions.<n>We propose textttVORTEX, a language-guided reward shaping framework that preserves established optimization goals while adaptively incorporating human feedback.
- Score: 40.48402462300208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In social impact optimization, AI decision systems often rely on solvers that optimize well-calibrated mathematical objectives. However, these solvers cannot directly accommodate evolving human preferences, typically expressed in natural language rather than formal constraints. Recent approaches address this by using large language models (LLMs) to generate new reward functions from preference descriptions. While flexible, they risk sacrificing the system's core utility guarantees. In this paper, we propose \texttt{VORTEX}, a language-guided reward shaping framework that preserves established optimization goals while adaptively incorporating human feedback. By formalizing the problem as multi-objective optimization, we use LLMs to iteratively generate shaping rewards based on verbal reinforcement and text-gradient prompt updates. This allows stakeholders to steer decision behavior via natural language without modifying solvers or specifying trade-off weights. We provide theoretical guarantees that \texttt{VORTEX} converges to Pareto-optimal trade-offs between utility and preference satisfaction. Empirical results in real-world allocation tasks demonstrate that \texttt{VORTEX} outperforms baselines in satisfying human-aligned coverage goals while maintaining high task performance. This work introduces a practical and theoretically grounded paradigm for human-AI collaborative optimization guided by natural language.
Related papers
- MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - LLMize: A Framework for Large Language Model-Based Numerical Optimization [0.0]
Large language models (LLMs) have recently shown strong reasoning capabilities beyond traditional language tasks.<n>This paper presents LLMize, an open-source Python framework that enables LLM-driven optimization.
arXiv Detail & Related papers (2025-12-30T20:05:30Z) - LAPPI: Interactive Optimization with LLM-Assisted Preference-Based Problem Instantiation [6.8772471411888425]
We introduce LAPPI (LLM-Assisted Preference-based Problem Instantiation), an interactive approach that uses large language models (LLMs) to support users in this instantiation process.<n>In a user study on trip planning, our method successfully captured user preferences and generated feasible plans that outperformed both conventional and prompt-engineering approaches.
arXiv Detail & Related papers (2025-12-16T06:43:38Z) - Bayesian Optimization in Language Space: An Eval-Efficient AI Self-Improvement Framework [0.0]
Large Language Models (LLMs) have recently enabled self-improving AI, i.e., AI that iteratively generates, evaluates, and refines its own outcomes.<n>In many societal applications, the primary limitation is not generating new solutions but evaluating them.<n>This paper overcomes this challenge by proving that the combination of the simple and widely used Best-of-N selection strategy and simple textual gradients statistically emulates the behavior of the gradients on the canonical UCB acquisition function.
arXiv Detail & Related papers (2025-11-15T07:04:44Z) - Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z) - OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents [8.441638148384389]
We introduce OptimAI, a framework for solving Optimization problems described in natural language.<n>Our framework is built upon the following key roles: formulator, planner, coder and code critic.<n>Our approach attains 88.1% accuracy on the NLP4LP dataset and 82.3% on the Optibench dataset, reducing error rates by 58% and 52%, respectively, over prior best results.
arXiv Detail & Related papers (2025-04-23T17:45:05Z) - A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - EmPO: Emotion Grounding for Empathetic Response Generation through Preference Optimization [9.934277461349696]
Empathetic response generation is a desirable aspect of conversational agents.
We propose a novel approach where we construct theory-driven preference datasets based on emotion grounding.
We show that LLMs can be aligned for empathetic response generation by preference optimization while retaining their general performance.
arXiv Detail & Related papers (2024-06-27T10:41:22Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - SEE: Strategic Exploration and Exploitation for Cohesive In-Context Prompt Optimization [8.975505323004427]
We propose a novel Cohesive In-Context Prompt Optimization framework for Large Language Models (LLMs)<n>We introduce SEE, a scalable and efficient prompt optimization framework that adopts metaheuristic optimization principles and strategically exploration and exploitation.<n> SEE significantly outperforms state-of-the-art baseline methods by a large margin, achieving an average performance gain of 13.94 while reducing computational costs by 58.67.
arXiv Detail & Related papers (2024-02-17T17:47:10Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Robust Prompt Optimization for Large Language Models Against
Distribution Shifts [80.6757997074956]
Large Language Model (LLM) has demonstrated significant ability in various Natural Language Processing tasks.
We propose a new problem of robust prompt optimization for LLMs against distribution shifts.
This problem requires the prompt optimized over the labeled source group can simultaneously generalize to an unlabeled target group.
arXiv Detail & Related papers (2023-05-23T11:30:43Z) - Offline RL for Natural Language Generation with Implicit Language Q
Learning [87.76695816348027]
Large language models can be inconsistent when it comes to completing user specified tasks.
We propose a novel RL method, that combines both the flexible utility framework of RL with the ability of supervised learning.
In addition to empirically validating ILQL, we present a detailed empirical analysis situations where offline RL can be useful in natural language generation settings.
arXiv Detail & Related papers (2022-06-05T18:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.