MetaphorVU: Towards Metaphorical Video Understanding
Abstract Overview
This paper introduces MetaphorVU-Bench, a benchmark for metaphorical video understanding built around a systematic taxonomy of 8 video-metaphor types. The benchmark contains 860 manually validated real-world short videos collected through multi-stage filtering and annotation, with evaluation focused on free-form interpretations of which visual elements convey which implicit meanings. Experiments across a range of multimodal large language models show that current systems remain well below human performance on this task. The authors argue that the main bottleneck is defective cross-domain mapping from visual elements to underlying concepts, and they propose MetaphorBoost to address this through inference-time augmentation with a metaphorical knowledge graph.
Novelty
The work appears to be the first systematic benchmark specifically dedicated to metaphorical video understanding, rather than metaphor in text, images, or a narrow video domain such as advertising. It also introduces a metaphor-oriented knowledge graph and an inference-time framework that uses it to support cross-domain metaphor mapping during video interpretation.
Results
On MetaphorVU-Bench, the strongest baseline models score around 63.7-63.8 on average, compared with 83.4 for the sampled human upper bound, indicating a substantial gap. Error analysis attributes most failures to missing, superficial, or improper cross-domain mapping rather than basic recognition errors. MetaphorBoost yields consistent gains across tested backbones, improving Gemini-3-Pro from 63.8 to 66.1, Qwen3-VL-8B-Thinking from 52.0 to 55.9, and Qwen2.5-VL-7B-Instruct from 33.8 to 37.9.
Key Points
- MetaphorVU-Bench organizes metaphorical video understanding into 8 taxonomy categories and includes 860 rigorously filtered and annotated real-world videos.
- Current MLLMs lag clearly behind human performance on metaphorical video interpretation, with the main weakness traced to cross-domain mapping rather than visual recognition alone.
- The proposed MetaphorBoost method uses a 54,687-node, 200,268-edge metaphorical knowledge graph for inference-time augmentation and delivers consistent improvements over multiple base models.
References
- arXiv: https://arxiv.org/abs/2605.25461v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2605.25461v1
- Hugging Face Papers: https://huggingface.co/papers/2605.25461
- GitHub: https://github.com/icip-cas/MetaphorVU
- Hugging Face: https://huggingface.co/datasets/lzq2021/MetaphorVU-Bench