IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval
- URL: http://arxiv.org/abs/2512.01312v1
- Date: Mon, 01 Dec 2025 06:12:59 GMT
- Title: IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval
- Authors: Ning Han, Yawen Zeng, Shaohua Long, Chengqing Li, Sijie Yang, Dun Tan, Jianfeng Dong, Jingjing Chen,
- Abstract summary: Interactive Video Corpus Retrieval (IVCR) task enables multi-turn, conversational, and realistic interactions between the user and the retrieval system.<n> IVCR-200K is a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval.<n>We propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions.
- Score: 36.33423199468626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful "interaction" between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8\% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.
Related papers
- MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed [55.526939500742]
We use OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, to generate unified embeddings for text, images, audio, and video.<n>Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025.
arXiv Detail & Related papers (2025-06-11T05:40:26Z) - CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval [70.9990850395981]
We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata.<n>CLaMR is trained to enhance dynamic modality selection via two key innovations.
arXiv Detail & Related papers (2025-06-06T15:02:30Z) - Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents [30.228721661677493]
RealVideoQuest is designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries.<n>It identifies 7.5K real user queries with video response intents and builds 4.5K high-quality query-video pairs.<n>Experiments indicate that current T2V models struggle with effectively addressing real user queries.
arXiv Detail & Related papers (2025-06-02T13:52:21Z) - Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs) [3.783822944546971]
Vision-language models (VLMs) excel in representation learning, but struggle with adaptive, time-sensitive video retrieval.<n>This paper introduces a novel framework that combines vector similarity search with graph-based data structures.
arXiv Detail & Related papers (2025-03-21T01:11:14Z) - Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.<n>We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.<n>This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z) - MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval [26.585985828583304]
We introduce MIRe, a retrieval framework that achieves modality interaction without fusing textual features during the alignment.<n>Our method allows the textual query to attend to visual embeddings while not feeding text-driven signals back into the visual representations.<n>Our experiments demonstrate that our pre-training strategy significantly enhances the understanding of multimodal queries.
arXiv Detail & Related papers (2024-11-13T04:32:58Z) - MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark.<n>It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events.<n>Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z) - Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text.<n>This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content.<n>We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - Simple Baselines for Interactive Video Retrieval with Questions and
Answers [33.17722358007974]
We propose several simple yet effective baselines for interactive video retrieval via question-answering.
We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task.
Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems.
arXiv Detail & Related papers (2023-08-21T00:32:19Z) - Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.