Fugu-MT 論文翻訳(概要): MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

論文の概要: MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

arxiv url: http://arxiv.org/abs/2510.21406v1
Date: Fri, 24 Oct 2025 12:50:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 09:00:15.47073
Title: MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence
Title（参考訳）: MUVR:マルチレベル視覚対応付きマルチモードアントリミングビデオ検索ベンチマーク
Authors: Yue Feng, Jinwei Hu, Qijia Lu, Jiawei Niu, Li Tan, Shuo Yuan, Ziyi Yan, Yizhen Jia, Qingzhi He, Shiping Ge, Ethan Q. Chen, Wentong Li, Limin Wang, Jie Qin,
Abstract要約: MUVRは、マルチモーダルクエリを使用して、関連セグメントを含む未トリミングなビデオを取得することを目的としている。 MUVRはビデオプラットフォームBilibiliの53Kビデオで構成されており、マルチモーダルクエリは1,050、マッチは84Kである。
参考スコア（独自算出の注目度）: 38.13428814544438
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.
Abstract（参考訳）: 本稿では,Multi-modal Untrimmed Video Retrievalタスクを提案する。 MUVRは、マルチモーダルクエリを使用して、関連セグメントを含む未トリミングなビデオを取得することを目的としている。以下の特徴がある。 1)実践的な検索パラダイム:MUVRはビデオ中心のマルチモーダルクエリをサポートし,長いテキスト記述,ビデオタグプロンプト,マスクプロンプトを通じて詳細な検索ニーズを表現している。これは一対多の検索パラダイムを採用し、長ビデオプラットフォームアプリケーション用に調整された、未トリミングビデオに焦点を当てている。 2) 共通映像カテゴリ(ニュース,旅行,ダンスなど)を網羅し,検索マッチング基準を正確に定義するために,ユーザが興味を持って検索したい中核映像コンテンツ(ニュースイベント,旅行場所,ダンスの動きなど)に基づいて,多段階の視覚対応を構築する。コピー、イベント、シーン、インスタンス、アクション、その他の6つのレベルをカバーする。 3)総合評価基準:MUVRの3つのバージョン(ベース,フィルタ,QA)を開発する。 MUVR-Base/Filterは検索モデルを評価し、MUVR-QAは質問応答形式でMLLMを評価する。また,MLLMの再ランク付け能力を評価するためのリグレートスコアを提案する。 MUVRはビデオプラットフォームBilibiliの53Kビデオで構成されており、マルチモーダルクエリは1,050、マッチは84Kである。 3つの最先端ビデオ検索モデル,6つの画像ベースVLM,10個のMLLMの大規模評価を行った。 MUVRは、未トリミングビデオやマルチモーダルクエリの検索方法の限界と、マルチビデオ理解と再ランク付けにおけるMLLMを明らかにしている。私たちのコードとベンチマークはhttps://github.com/debby-0527/MUVR.comで公開されています。

論文の概要: MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

関連論文リスト