Related papers: Interactive Benchmarks

Interactive Benchmarks

URL: http://arxiv.org/abs/2603.04737v1
Date: Thu, 05 Mar 2026 02:18:26 GMT
Title: Interactive Benchmarks
Authors: Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, Mengdi Wang,
Abstract summary: We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints.<n>We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities.
Score: 45.705288760439636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench

Related papers

[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games [0.0]
Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks.<n>This study investigates the thoroughness of a negotiation benchmark based on Scoreable Games.<n>Our results highlight the importance of context in model-comparative evaluations.
arXiv Detail & Related papers (2026-02-20T14:11:31Z)
IDRBench: Interactive Deep Research Benchmark [22.089706516440902]
We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research.<n>IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite.
arXiv Detail & Related papers (2026-01-10T20:29:12Z)
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection [71.8243083897721]
Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability.<n>We present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training.
arXiv Detail & Related papers (2025-09-27T10:37:11Z)
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation [56.87891213797931]
We present MTR-Bench for Large Language Models' Multi-Turn Reasoning evaluation.<n>Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities.<n>MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations.
arXiv Detail & Related papers (2025-05-21T17:59:12Z)
A Statistical Framework for Ranking LLM-Based Chatbots [57.59268154690763]
We propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis.<n>First, we introduce a factored tie model that enhances the ability to handle groupings of human-judged comparisons.<n>Second, we extend the framework to model covariance tiers between competitors, enabling deeper insights into performance relationships.<n>Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints.
arXiv Detail & Related papers (2024-12-24T12:54:19Z)
Can foundation models actively gather information in interactive environments to test hypotheses? [43.42688356541211]
Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments.<n>We evaluated these models on their ability to learn from experience, adapt, and gather information.
arXiv Detail & Related papers (2024-12-09T12:27:21Z)
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation [0.0]
We introduce a benchmark for evaluating the role-playing capabilities of language models.<n>We leverage different language models to simulate users in dynamic, multi-turn conversations and assess the resulting dialogues.<n>We evaluated more than 40 models in both English and Russian, with each model participating in 64 conversations with 8 characters and 8 situations.
arXiv Detail & Related papers (2024-09-10T19:00:44Z)
TETRIS: Towards Exploring the Robustness of Interactive Segmentation [39.1981941213761]
We propose a methodology for finding extreme user inputs by a direct optimization in a white-box adversarial attack on the interactive segmentation model. We report the results of an extensive evaluation of dozens of models.
arXiv Detail & Related papers (2024-02-09T01:36:21Z)
JAB: Joint Adversarial Prompting and Belief Augmentation [81.39548637776365]
We introduce a joint framework in which we probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes.
arXiv Detail & Related papers (2023-11-16T00:35:54Z)
Pseudointelligence: A Unifying Framework for Language Model Evaluation [14.95543156914676]
We propose a complexity-theoretic framework of model evaluation cast as a dynamic interaction between a model and a learned evaluator. We demonstrate that this framework can be used to reason about two case studies in language model evaluation, as well as analyze existing evaluation methods.
arXiv Detail & Related papers (2023-10-18T17:48:05Z)
Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models. We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.