FuguReport

MetaphorVU: Towards Metaphorical Video Understanding

Authors Zhuoqun Li, Boxi Cao, Guiping Jiang, Fangrui Lv, Ruotong Pan, Jianan Wang, Xiangyu Wu, Hongyu Lin, Yaojie Lu, Yong Du, Ruyin Jia, Liyan, Tingting Gao, Han Li, Xianpei Han, Le Sun
Affiliations Chinese Academy of Sciences / Tsinghua University / Kuaishou Technology
Categories Method / Knowledge Graph / Metaphor knowledge graph construction, Task / Video Understanding / Metaphorical video comprehension, Method / Inference Optimization / Extended inference framework for metaphor mapping
License CC BY 4.0

Abstract Overview

This paper introduces MetaphorVU-Bench, a benchmark for metaphorical video understanding built around a systematic taxonomy of 8 video-metaphor types. The benchmark contains 860 manually validated real-world short videos collected through multi-stage filtering and annotation, with evaluation focused on free-form interpretations of which visual elements convey which implicit meanings. Experiments across a range of multimodal large language models show that current systems remain well below human performance on this task. The authors argue that the main bottleneck is defective cross-domain mapping from visual elements to underlying concepts, and they propose MetaphorBoost to address this through inference-time augmentation with a metaphorical knowledge graph.

Novelty

The work appears to be the first systematic benchmark specifically dedicated to metaphorical video understanding, rather than metaphor in text, images, or a narrow video domain such as advertising. It also introduces a metaphor-oriented knowledge graph and an inference-time framework that uses it to support cross-domain metaphor mapping during video interpretation.

Results

On MetaphorVU-Bench, the strongest baseline models score around 63.7-63.8 on average, compared with 83.4 for the sampled human upper bound, indicating a substantial gap. Error analysis attributes most failures to missing, superficial, or improper cross-domain mapping rather than basic recognition errors. MetaphorBoost yields consistent gains across tested backbones, improving Gemini-3-Pro from 63.8 to 66.1, Qwen3-VL-8B-Thinking from 52.0 to 55.9, and Qwen2.5-VL-7B-Instruct from 33.8 to 37.9.

Key Points

  1. MetaphorVU-Bench organizes metaphorical video understanding into 8 taxonomy categories and includes 860 rigorously filtered and annotated real-world videos.
  2. Current MLLMs lag clearly behind human performance on metaphorical video interpretation, with the main weakness traced to cross-domain mapping rather than visual recognition alone.
  3. The proposed MetaphorBoost method uses a 54,687-node, 200,268-edge metaphorical knowledge graph for inference-time augmentation and delivers consistent improvements over multiple base models.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.