Related papers: Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents

Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents

URL: http://arxiv.org/abs/2506.01689v1
Date: Mon, 02 Jun 2025 13:52:21 GMT
Title: Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents
Authors: Shuting Wang, Yunqi Liu, Zixin Yang, Ning Hu, Zhicheng Dou, Chenyan Xiong,
Abstract summary: RealVideoQuest is designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries.<n>It identifies 7.5K real user queries with video response intents and builds 4.5K high-quality query-video pairs.<n>Experiments indicate that current T2V models struggle with effectively addressing real user queries.
Score: 30.228721661677493
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Querying generative AI models, e.g., large language models (LLMs), has become a prevalent method for information acquisition. However, existing query-answer datasets primarily focus on textual responses, making it challenging to address complex user queries that require visual demonstrations or explanations for better understanding. To bridge this gap, we construct a benchmark, RealVideoQuest, designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries. It identifies 7.5K real user queries with video response intents from Chatbot-Arena and builds 4.5K high-quality query-video pairs through a multistage video retrieval and refinement process. We further develop a multi-angle evaluation system to assess the quality of generated video answers. Experiments indicate that current T2V models struggle with effectively addressing real user queries, pointing to key challenges and future research opportunities in multimodal AI.

Related papers

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z)
Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models [21.966865098520277]
Video Large Language Models (Video-LLMs) are predominantly trained on questions generated directly from video content.<n>In real-world scenarios, users often pose questions that extend beyond the informational scope of the video.<n>We propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video.
arXiv Detail & Related papers (2025-07-07T13:19:43Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
Vidi: Large Multimodal Models for Video Understanding and Editing [33.56852569192024]
We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios.<n>The first release focuses on temporal retrieval, identifying the time ranges within the input videos corresponding to a given text query.<n>We also present the VUE-TR benchmark, which introduces five key advancements.
arXiv Detail & Related papers (2025-04-22T08:04:45Z)
Lost in Time: A New Temporal Benchmark for VideoLLMs [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning.<n>We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)<n>We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.<n>We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z)
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA. R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model. With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z)
Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA. Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z)
Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog. The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z)
Fill-in-the-blank as a Challenging Video Understanding Evaluation Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests. We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.