Related papers: IDRBench: Interactive Deep Research Benchmark

IDRBench: Interactive Deep Research Benchmark

URL: http://arxiv.org/abs/2601.06676v1
Date: Sat, 10 Jan 2026 20:29:12 GMT
Title: IDRBench: Interactive Deep Research Benchmark
Authors: Yingchaojie Feng, Qiang Huang, Xiaoya Xie, Zhaorui Yang, Jun Yu, Wei Chen, Anthony K. H. Tung,
Abstract summary: We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research.<n>IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite.
Score: 22.089706516440902
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.

Related papers

Interactive Benchmarks [45.705288760439636]
We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints.<n>We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities.
arXiv Detail & Related papers (2026-03-05T02:18:26Z)
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction [1.3511057160494195]
Leader-follower interaction is an important paradigm in human-robot interaction (HRI)<n>Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated.
arXiv Detail & Related papers (2026-02-26T18:20:26Z)
Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning [66.52010873968383]
We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training.<n>The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods.
arXiv Detail & Related papers (2026-01-19T14:55:54Z)
SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z)
Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms [55.1784306456972]
Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference.<n>We use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors.<n>We uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity.
arXiv Detail & Related papers (2025-09-28T15:13:38Z)
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z)
Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations [70.94563079082751]
E-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions.<n>We propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval.<n>Our approach builds on a generative retriever, further augmented with a test-time reranking mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue.
arXiv Detail & Related papers (2025-08-25T15:38:56Z)
ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models [41.35497807436858]
We introduce ProactiveVideoQA, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction.<n>We also propose PAUC, the first metric that accounts for the temporal dynamics of model responses.<n>These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios.
arXiv Detail & Related papers (2025-07-12T15:11:50Z)
Efficient Quantification of Multimodal Interaction at Sample Level [12.373485315058513]
We introduce the Lightweight Sample-wise Multimodal Interaction (LSMI) estimator, rigorously grounded in pointwise information theory.<n>We first develop a redundancy estimation framework, employing an appropriate pointwise information measure to quantify this most decomposable interaction.<n>Building upon this, we propose a general interaction estimation method that employs efficient entropy estimation.
arXiv Detail & Related papers (2025-06-08T02:39:25Z)
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation [56.87891213797931]
We present MTR-Bench for Large Language Models' Multi-Turn Reasoning evaluation.<n>Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities.<n>MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations.
arXiv Detail & Related papers (2025-05-21T17:59:12Z)
Dynamic benchmarking framework for LLM-based conversational data capture [0.0]
This paper introduces a benchmarking framework to assess large language models (LLMs)<n>It integrates generative agent simulation to evaluate performance on key dimensions: information extraction, context awareness, and adaptive engagement.<n>Results show that adaptive strategies improve data extraction accuracy, especially when handling ambiguous responses.
arXiv Detail & Related papers (2025-02-04T15:47:47Z)
Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance. This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings. Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z)
Interaction Pattern Disentangling for Multi-Agent Reinforcement Learning [39.4394389642761]
We introduce a novel interactiOn Pattern disenTangling (OPT) method to disentangle the entity interactions into interaction prototypes. OPT facilitates filtering the noisy interactions between irrelevant entities and thus significantly improves generalizability as well as interpretability. Experiments on single-task, multi-task and zero-shot benchmarks demonstrate that the proposed method yields results superior to the state-of-the-art counterparts.
arXiv Detail & Related papers (2022-07-08T13:42:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.