Related papers: Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators

Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators

URL: http://arxiv.org/abs/2410.15163v2
Date: Thu, 24 Oct 2024 04:46:32 GMT
Title: Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators
Authors: Timothy Wei, Annabelle Miin, Anastasia Miin,
Abstract summary: Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications. We propose a flexible framework that enables LLMs to interact with system interfaces, summarize constraint concepts, and continually optimize performance metrics. Our framework achieved a $7.78%$ pass rate with the human discriminator and a $6.11%$ pass rate with the LLM-based discriminator.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications. However, due to the current text-in-text-out paradigm, it remains challenging for LLMs to handle dynamic and complex application constraints, let alone devise general solutions that meet predefined system goals. Current common practices like model finetuning and reflection-based reasoning often address these issues case-by-case, limiting their generalizability. To address this issue, we propose a flexible framework that enables LLMs to interact with system interfaces, summarize constraint concepts, and continually optimize performance metrics by collaborating with human experts. As a case in point, we initialized a travel planner agent by establishing constraints from evaluation interfaces. Then, we employed both LLM-based and human discriminators to identify critical cases and continuously improve agent performance until the desired outcomes were achieved. After just one iteration, our framework achieved a $7.78\%$ pass rate with the human discriminator (a $40.2\%$ improvement over baseline) and a $6.11\%$ pass rate with the LLM-based discriminator. Given the adaptability of our proposal, we believe this framework can be applied to a wide range of constraint-based applications and lay a solid foundation for model finetuning with performance-sensitive data samples.

Related papers

LM4Opt-RA: A Multi-Candidate LLM Framework with Structured Ranking for Automating Network Resource Allocation [0.7933039558471408]
We tackle complex analytical and mathematical reasoning tasks requiring nuanced contextual understanding.<n>Existing benchmarks datasets cannot address the complexities of such problems with dynamic environments, variables, and heterogeneous constraints.<n>We introduce NL4RA, a curated dataset comprising 50 resource allocation optimization problems formulated as LP, ILP, and MILP.<n>We then evaluate the performance of well-known open-source LLMs with varying parameter counts.
arXiv Detail & Related papers (2025-11-13T23:19:43Z)
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets [13.111181135818184]
Large language models (LLMs) have shown strong performance on complex mathematical tasks, including optimization.<n>Applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored.<n>We employ a novel benchmark of 369 instances of the College Admission Problem to evaluate LLMs across key dimensions: feasibility, stability, and optimality.
arXiv Detail & Related papers (2025-09-16T14:48:46Z)
EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [65.48902212293903]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z)
Generalizing Large Language Model Usability Across Resource-Constrained [0.43512163406552007]
dissertation presents a systematic study toward generalizing Large Language Models under real-world constraints.<n>First, it introduces a robust text-centric alignment framework that enables LLMs to seamlessly integrate diverse modalities.<n>Beyond multimodal setting, the dissertation investigates inference-time optimization strategies for LLMs.
arXiv Detail & Related papers (2025-05-13T01:00:12Z)
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications [28.181295575180293]
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their transition to real-world applications reveals a critical limitation. This paper presents the first comprehensive survey of personalized alignment. We propose a unified framework comprising preference memory management, personalized generation, and feedback-based alignment.
arXiv Detail & Related papers (2025-03-21T10:09:16Z)
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality [52.49062565901046]
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models with human values.<n>Existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences.<n>We introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations.
arXiv Detail & Related papers (2025-03-10T09:52:42Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks. However, they still struggle with problems requiring multi-step decision-making and environmental feedback. We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications. Ensuring their alignment with the diverse preferences of individual users has become a critical challenge. We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making. Existing evaluations tend to rely solely on a final success rate. We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z)
LLM-CI: Assessing Contextual Integrity Norms in Language Models [1.1715858161748576]
Large language models (LLMs) may inadvertently encode societal preferences and norms. This is especially challenging due to prompt sensitivity$-$small variations in prompts yield different responses. We present LLM-CI, the first open-sourced framework to assess encoded norms.
arXiv Detail & Related papers (2024-09-05T17:50:31Z)
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models [8.558834738072363]
Large language models (LLMs) have been widely adopted due to their remarkable performance across various applications. These individual LLMs show limitations in generalization and performance on complex tasks due to inherent training biases, model size constraints, and the quality or diversity of pre-training datasets. We introduce SelectLLM, which efficiently directs input queries to the most suitable subset of LLMs from a large pool.
arXiv Detail & Related papers (2024-08-16T06:11:21Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Value Augmented Sampling for Language Model Alignment and Personalization [39.070662999014836]
We present a new framework for reward optimization, Value Augmented Sampling (VAS) VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function. Our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time.
arXiv Detail & Related papers (2024-05-10T17:59:04Z)
Efficient and Responsible Adaptation of Large Language Models for Robust Top-k Recommendations [11.004673022505566]
Long user queries from millions of users can degrade the performance of large language models for recommendation. We propose a hybrid task allocation framework that utilizes the capabilities of both large language models and traditional recommendation systems. Our results on three real-world datasets show a significant reduction in weak users and improved robustness of RSs to sub-populations.
arXiv Detail & Related papers (2024-05-01T19:11:47Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models. We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z)
Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.