Fugu-MT 論文翻訳(概要): V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs

論文の概要: V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs

arxiv url: http://arxiv.org/abs/2509.25773v1
Date: Tue, 30 Sep 2025 04:33:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.4291
Title: V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs
Title（参考訳）: V-HUB:ビデオLLMのための視覚中心の暗雲理解ベンチマーク
Authors: Zhengpeng Shi, Hengli Li, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng,
Abstract要約: v-HUBは視覚中心のビデオユーモア理解ベンチマークである。各ビデオクリップは、キャプション、説明、説明を含むリッチなアノテーションとペアリングされる。我々は,特殊なビデオLLMから音声処理が可能な汎用OmniLLMまで,MLLMの多様なセットを評価する。
参考スコア（独自算出の注目度）: 72.59885036868499
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel visual-centric video humor understanding benchmark. v-HUB comprises a curated collection of minimally verbal short videos, sourced from classic silent films and online resources, and reflecting real-world scenarios where humor can be appreciated purely through visual cues. Each video clip is paired with rich annotations, including captions, descriptions, and explanations, supporting evaluation tasks like caption matching and humor explanation. To broaden its applicability, we further construct an open-ended video QA task, making it readily integrable into existing video understanding benchmarks. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. For example, all models exhibit a marked performance drop on caption matching when moving from text-based to video-based evaluation (without audio). Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the informativeness of sound and the promise of integrating richer modalities for complex video understanding tasks.
Abstract（参考訳）: ユーモアを解釈できるAIモデルは、例えば人間と機械の相互作用におけるエンゲージメントを高めるなど、現実世界の約束を守ります。ユーモア理解のためのマルチモーダル大言語モデル(MLLM)のキャパシティを計測し、診断するために、新しいビジュアル中心のビデオユーモア理解ベンチマークであるv-HUBを導入する。 v-HUBは、古典的なサイレント映画やオンラインリソースをソースとした、最小限の短いビデオのキュレートされたコレクションと、ユーモアが視覚的な手がかりによって純粋に評価される現実世界のシナリオを反映している。各ビデオクリップには、キャプション、説明、説明を含む豊富なアノテーションが組み込まれ、キャプションマッチングやユーモアの説明などの評価タスクをサポートする。適用性を高めるために、我々は、既存のビデオ理解ベンチマークに容易に統合できるように、よりオープンなビデオQAタスクを構築する。我々は,特殊なビデオLLMから,オープンソースドメインとプロプライエタリドメインの両方をカバーするオーディオ処理が可能な汎用OmniLLMまで,MLLMの多様なセットを評価する。実験の結果,視覚的手がかりのみでのユーモアの理解においてMLLMが直面する困難さが明らかになった。例えば、すべてのモデルは、テキストベースからビデオベースの評価(音声なしで)に移行する際に、キャプションマッチングに顕著なパフォーマンス低下を示す。また,音声を組み込むことは,映像のユーモアの理解に役立ち,音の情報性が向上し,複雑な映像理解タスクへのリッチなモダリティの統合が期待できることを示す。

論文の概要: V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs

関連論文リスト