Related papers: Evaluating Language Model Agency through Negotiations

Evaluating Language Model Agency through Negotiations

URL: http://arxiv.org/abs/2401.04536v2
Date: Sat, 16 Mar 2024 16:41:48 GMT
Title: Evaluating Language Model Agency through Negotiations
Authors: Tim R. Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, Robert West,
Abstract summary: Negotiation games enable us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental evaluation data leakage. We use our approach to test six widely used and publicly accessible LMs, evaluating performance and alignment in both self-play and cross-play settings.
Score: 39.87262815823634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce an approach to evaluate language model (LM) agency using negotiation games. This approach better reflects real-world use cases and addresses some of the shortcomings of alternative LM benchmarks. Negotiation games enable us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental evaluation data leakage. We use our approach to test six widely used and publicly accessible LMs, evaluating performance and alignment in both self-play and cross-play settings. Noteworthy findings include: (i) only closed-source models tested here were able to complete these tasks; (ii) cooperative bargaining games proved to be most challenging to the models; and (iii) even the most powerful models sometimes "lose" to weaker opponents

Related papers

Scheming Ability in LLM-to-LLM Strategic Interactions [4.873362301533824]
Large language model (LLM) agents are deployed autonomously in diverse contexts.<n>We investigate the ability and propensity of frontier LLM agents through two game-theoretic frameworks.<n>Tests four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b)
arXiv Detail & Related papers (2025-10-11T04:42:29Z)
A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench [18.149327897427234]
We present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use.<n>We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English)
arXiv Detail & Related papers (2025-07-11T11:16:01Z)
Towards Robust Dialogue Breakdown Detection: Addressing Disruptors in Large Language Models with Self-Guided Reasoning [30.13634341221476]
Large language models (LLMs) are rapidly changing various domains. This paper addresses the challenge of detecting and mitigating dialogue breakdowns within LLM-driven systems. We propose an approach that combines specialized fine-tuning with advanced prompting strategies.
arXiv Detail & Related papers (2025-04-26T07:51:05Z)
Reproducibility Study of Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation [0.0]
We validate the original findings using a range of open-weight models. We propose a communication-free baseline to test whether successful negotiations are possible without agent interaction. This work also provides insights into the accessibility, fairness, environmental impact, and privacy considerations of LLM-based negotiation systems.
arXiv Detail & Related papers (2025-02-22T14:28:49Z)
Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles [20.18736445118689]
We introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit lateral thinking of Large Language Models (LLMs) This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation. Experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy.
arXiv Detail & Related papers (2024-10-09T10:09:11Z)
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
We demonstrate a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales. The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving. We take these initial observations to stimulate urgent re-assessment of the claimed capabilities of current generation of Large Language Models.
arXiv Detail & Related papers (2024-06-04T07:43:33Z)
SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation [23.203761925540736]
We propose a novel framework SLIDE (Small and Large Integrated for Dialogue Evaluation) Our approach achieves state-of-the-art performance in both the classification and evaluation tasks, and additionally the SLIDE exhibits better correlation with human evaluators.
arXiv Detail & Related papers (2024-05-24T20:32:49Z)
Ranking Large Language Models without Ground Truth [24.751931637152524]
Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models. We provide a novel perspective where, given a dataset of prompts, we rank them without access to any ground truth or reference responses. Applying this idea repeatedly, we propose two methods to rank LLMs.
arXiv Detail & Related papers (2024-02-21T00:49:43Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models. We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability. We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs) GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms. We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z)
Unlocking the Potential of User Feedback: Leveraging Large Language Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model. This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models. Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.