Fugu-MT 論文翻訳(概要): ViMU: Benchmarking Video Metaphorical Understanding

論文の概要: ViMU: Benchmarking Video Metaphorical Understanding

arxiv url: http://arxiv.org/abs/2605.14607v1
Date: Thu, 14 May 2026 09:23:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.749696
Title: ViMU: Benchmarking Video Metaphorical Understanding
Title（参考訳）: ViMU:ビデオメタフォリカル理解のベンチマーク
Authors: Qi Li, Xinchao Wang,
Abstract要約: ViMUはビデオのフロンティアモデルのサブテキスト理解能力を評価するために設計されたベンチマークである。暗黙的な意味を推測するために、ビデオ理解モデルがリテラル認識を超えることができるかどうかを評価する。すべての質問はヒントのないように設計されており、答える前に重要な証拠がモデルに開示されることが保証されている。
参考スコア（独自算出の注目度）: 58.432996881401415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.
Abstract（参考訳）: 新しい媒体は一度現れると、オーバートコンテンツのみの送信よりも多く使用される。 1つは直接提示されるコンテンツであり、もう1つはその下にある暗黙の考えと、創造者が媒体を通して伝えようとする意図である。同様に、ビデオ技術が広く採用されるようになってから、ビデオは視覚情報を記録・伝達するための強力なツールとしてだけでなく、感情、態度、社会的意味を明確化することがしばしば難しいものにもなっている。したがって、多くのビデオの真の意味は、画面上に表示されるものだけに留まらず、しばしば文脈、表現様式、視聴者の社会的経験に埋め込まれる。このようなビデオのサブテキストはユーモラスなものもあれば、皮肉やモック、批判的なものもある。これらの暗黙の意味は、文化的背景や社会的グループによって非常に異なる解釈が可能である。しかしながら、既存のビデオ理解モデルは、オブジェクト、行動、時間的関係を認識することや、ビデオに埋め込まれた比喩的、皮肉的、社会的意味を理解する体系的な能力の欠如など、リテラルな視覚的理解に焦点を当てている。このギャップを埋めるために、ビデオにおけるフロンティアモデルのサブテキスト理解能力を体系的に評価する最初のベンチマークであるViMUを導入する。 ViMUは、ビデオ理解モデルがリテラル認識を超えて暗黙的な意味を推論できるかどうかを評価し、その解釈をマルチモーダルなエビデンスで根拠づけ、オープンエンドとマルチチョイスの両方の疑問に答える。重要なことに、すべての質問はヒントのないように設計されており、答える前に重要な証拠がモデルに開示されることが保証されている。

論文の概要: ViMU: Benchmarking Video Metaphorical Understanding

関連論文リスト