Related papers: ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models

ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models

URL: http://arxiv.org/abs/2507.09313v2
Date: Tue, 15 Jul 2025 11:48:07 GMT
Title: ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models
Authors: Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, Dongyan Zhao,
Abstract summary: We introduce ProactiveVideoQA, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction.<n>We also propose PAUC, the first metric that accounts for the temporal dynamics of model responses.<n>These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios.
Score: 41.35497807436858
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveVideoQA, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveVideoQA and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: https://github.com/yellow-binary-tree/ProactiveVideoQA

Related papers

HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding [79.06209664703258]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment [18.814181652728486]
We propose a noise-robust voice activity projection model to enhance real-time turn-taking in dialogue robots.<n>We conducted a field experiment in a shopping mall, comparing the VAP system with a conventional cloud-based speech recognition system.<n>The results showed that the proposed system significantly reduced response latency, leading to a more natural conversation.
arXiv Detail & Related papers (2025-03-08T14:53:20Z)
Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z)
Mind the Gap! Static and Interactive Evaluations of Large Audio Models [55.87220295533817]
Large Audio Models (LAMs) are designed to power voice-native experiences.<n>This study introduces an interactive approach to evaluate LAMs and collect 7,500 LAM interactions from 484 participants.
arXiv Detail & Related papers (2025-02-21T20:29:02Z)
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback [33.532239489610056]
FB-Bench is a benchmark designed to evaluate Large Language Models' responsiveness to human feedback under real-world usage scenarios in Chinese.<n>We extensively evaluate a broad array of popular LLMs, revealing significant variations in their performance across different interaction scenarios.<n>Our findings underscore both the strengths and limitations of current models, providing valuable insights and directions for future research.
arXiv Detail & Related papers (2024-10-12T07:40:01Z)
ProCIS: A Benchmark for Proactive Retrieval in Conversations [21.23826888841565]
We introduce a large-scale dataset for proactive document retrieval that consists of over 2.8 million conversations. We conduct crowdsourcing experiments to obtain high-quality and relatively complete relevance judgments. We also collect annotations related to the parts of the conversation that are related to each document, enabling us to evaluate proactive retrieval systems.
arXiv Detail & Related papers (2024-05-10T13:11:07Z)
DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z)
PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems [59.1250765143521]
Current knowledge-grounded dialogue systems often fail to align the generated responses with human-preferred qualities. We propose Polished & Informed Candidate Scoring (PICK), a generation re-scoring framework. We demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history.
arXiv Detail & Related papers (2023-09-19T08:27:09Z)
Simple Baselines for Interactive Video Retrieval with Questions and Answers [33.17722358007974]
We propose several simple yet effective baselines for interactive video retrieval via question-answering. We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task. Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems.
arXiv Detail & Related papers (2023-08-21T00:32:19Z)
Our Model Achieves Excellent Performance on MovieLens: What Does it Mean? [43.3971105361606]
We conduct a meticulous analysis of the MovieLens dataset. There are significant differences in user interactions at the different stages when a user interacts with the MovieLens platform. We discuss the discrepancy between the interaction generation mechanism that is employed by the MovieLens system and that of typical real-world recommendation scenarios.
arXiv Detail & Related papers (2023-07-19T13:44:32Z)
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest [17.247452803197362]
This paper presents Pinterest's ranking architecture for Homefeed. We propose TransAct, a sequential model that extracts users' short-term preferences from their realtime activities. We describe the results of ablation studies, the challenges we faced during productionization, and the outcome of an online A/B experiment.
arXiv Detail & Related papers (2023-05-31T23:45:29Z)
Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator [37.590563896382456]
We propose an interactive evaluation framework for Task-Oriented Dialogue (TOD) systems. We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues. Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates.
arXiv Detail & Related papers (2022-10-26T07:41:32Z)
Evaluating Interactive Summarization: an Expansion-Based Framework [97.0077722128397]
We develop an end-to-end evaluation framework for interactive summarization. Our framework includes a procedure of collecting real user sessions and evaluation measures relying on standards. All of our solutions are intended to be released publicly as a benchmark.
arXiv Detail & Related papers (2020-09-17T15:48:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.