Fugu-MT 論文翻訳(概要): "I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

論文の概要: "I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

arxiv url: http://arxiv.org/abs/2604.05930v1
Date: Tue, 07 Apr 2026 14:31:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.879647
Title: "I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?
Title（参考訳）: 大型ビジョンランゲージモデルはマルチモーダル・プンを理解できるか?
Authors: Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li, Tianyu Du, Jun Wang, Zhihui Fu, Jinbao Li, Shouling Ji,
Abstract要約: パンは、ユーモアを生み出すためにポリセミーと音声の類似性を利用する、修辞的な言葉プレイの一般的な形式である。視覚言語モデルは多モーダルな理解と生成に広く用いられているが、その理解能力は体系的に研究されていない。我々は,多種多様な句からなるデータセットであるMultiPunを紹介した。われわれの評価によると、ほとんどのモデルでは、本物の句をこれらの散らばり物と区別するのに苦労している。
参考スコア（独自算出の注目度）: 52.182269580349605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.
Abstract（参考訳）: パンは、ユーモアを生み出すためにポリセミーと音声の類似性を利用する、修辞的な言葉プレイの一般的な形式である。マルチモーダルの句において、視覚的要素とテキスト的要素は、リテラル感覚を接地し、同時に比喩的意味を誘発する。 VLM(Vision-Language Models)はマルチモーダルな理解と生成に広く用いられているが、厳密なベンチマークが不足しているため、その理解能力は体系的に研究されていない。そこで我々はまず,マルチモーダル・パント生成パイプラインを提案する。次に,MultiPunを紹介した。われわれの評価によると、ほとんどのモデルでは、本物の句をこれらの散らばり物と区別するのに苦労している。さらに,F1スコアの平均値が16.5%向上し,句理解を高めるためのプロンプトレベル戦略とモデルレベル戦略の両方を提案する。本研究は,人間のようなユーモアの微妙さをクロスモーダル推論によって習得する,将来的なVLMの開発に有用な知見を提供する。

論文の概要: "I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

関連論文リスト