Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models
- URL: http://arxiv.org/abs/2508.05165v1
- Date: Thu, 07 Aug 2025 08:54:27 GMT
- Title: Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models
- Authors: Mason Nakamura, Saaduddin Mahmud, Kyle H. Wray, Hamed Zamani, Shlomo Zilberstein,
- Abstract summary: We propose HIA (Heuristic-Guided Inference-time Alignment), a tuning-free, black-box-compatible approach that uses a lightweight prompt.<n>We find that HIA is effective under low-inference budgets with as little as one or two response queries.
- Score: 23.37504394417425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning LLMs with user preferences is crucial for real-world use but often requires costly fine-tuning or expensive inference, forcing trade-offs between alignment quality and computational cost. Existing inference-time methods typically ignore this balance, focusing solely on the optimized policy's performance. We propose HIA (Heuristic-Guided Inference-time Alignment), a tuning-free, black-box-compatible approach that uses a lightweight prompt optimizer, heuristic reward models, and two-stage filtering to reduce inference calls while preserving alignment quality. On real-world prompt datasets, HelpSteer and ComPRed, HIA outperforms best-of-N sampling, beam search, and greedy search baselines in multi-objective, goal-conditioned tasks under the same inference budget. We also find that HIA is effective under low-inference budgets with as little as one or two response queries, offering a practical solution for scalable, personalized LLM deployment.
Related papers
- Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees [21.2175476090125]
Open-weight LLM zoos provide access to numerous high-quality models.<n>Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities.<n>We introduce MESS+, a cost-optimal optimization algorithm for cost-optimal request routing.
arXiv Detail & Related papers (2025-05-26T13:11:08Z) - ABoN: Adaptive Best-of-N Alignment [19.22348775001393]
We propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently.<n>Our method is simple, practical, and compatible with any LM/RM combination.
arXiv Detail & Related papers (2025-05-17T15:24:48Z) - DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models [50.32663816994459]
Diffusion-styled Preference Optimization (model) provides an efficient and policy-agnostic solution for aligning LLMs with humans.<n>modelavoids the time latency associated with token-level generation.<n>Experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that modelachieves superior alignment performance across various settings.
arXiv Detail & Related papers (2025-03-06T09:21:54Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models [54.381650481255235]
We introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (O)
Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions.
Empirical evaluations on eight recent LLMs, both open and closed-sourced, demonstrate that DRPO significantly enhances alignment performance.
arXiv Detail & Related papers (2024-11-13T16:15:38Z) - Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models [79.84205827056907]
We present Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference data.<n>$SSO$ employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data.<n>Our evaluation shows that $SSO$ consistently outperforms baselines in human preference alignment and reward optimization.
arXiv Detail & Related papers (2024-10-22T16:04:03Z) - OptLLM: Optimal Assignment of Queries to Large Language Models [12.07164196530872]
We propose a framework for addressing the cost-effective query allocation problem for large language models (LLMs)
Our framework, named OptLLM, provides users with a range of optimal solutions to choose from, aligning with their budget constraints and performance preferences.
To evaluate the effectiveness of OptLLM, we conduct extensive experiments on various types of tasks, including text classification, question answering, sentiment analysis, reasoning, and log parsing.
arXiv Detail & Related papers (2024-05-24T01:05:37Z) - Value Augmented Sampling for Language Model Alignment and Personalization [39.070662999014836]
We present a new framework for reward optimization, Value Augmented Sampling (VAS)
VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function.
Our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time.
arXiv Detail & Related papers (2024-05-10T17:59:04Z) - Active Preference Optimization for Sample Efficient RLHF [27.772423917657626]
Large Language Models (LLMs) aligned using Reinforcement Learning from Human Feedback (RLHF)<n>We show that uniform sampling of contexts could lead to a policy that suffers a constant sub-optimality gap from the optimal policy.<n>We propose an algorithm, $textttAPO$, that iteratively collects preferences for the most uncertain contexts.
arXiv Detail & Related papers (2024-02-16T08:19:34Z) - Solving Multistage Stochastic Linear Programming via Regularized Linear
Decision Rules: An Application to Hydrothermal Dispatch Planning [77.34726150561087]
We propose a novel regularization scheme for linear decision rules (LDR) based on the AdaSO (adaptive least absolute shrinkage and selection operator)
Experiments show that the overfit threat is non-negligible when using the classical non-regularized LDR to solve MSLP.
For the LHDP problem, our analysis highlights the following benefits of the proposed framework in comparison to the non-regularized benchmark.
arXiv Detail & Related papers (2021-10-07T02:36:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.