Related papers: V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs

V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs

URL: http://arxiv.org/abs/2509.25773v1
Date: Tue, 30 Sep 2025 04:33:52 GMT
Title: V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs
Authors: Zhengpeng Shi, Hengli Li, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng,
Abstract summary: v-HUB is a visual-centric video humor understanding benchmark.<n>Each video clip is paired with rich annotations, including captions, descriptions, and explanations.<n>We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio.
Score: 72.59885036868499
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel visual-centric video humor understanding benchmark. v-HUB comprises a curated collection of minimally verbal short videos, sourced from classic silent films and online resources, and reflecting real-world scenarios where humor can be appreciated purely through visual cues. Each video clip is paired with rich annotations, including captions, descriptions, and explanations, supporting evaluation tasks like caption matching and humor explanation. To broaden its applicability, we further construct an open-ended video QA task, making it readily integrable into existing video understanding benchmarks. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. For example, all models exhibit a marked performance drop on caption matching when moving from text-based to video-based evaluation (without audio). Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the informativeness of sound and the promise of integrating richer modalities for complex video understanding tasks.

Related papers

UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models [35.952441992916235]
We introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities.<n>We design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model.<n>We construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o.
arXiv Detail & Related papers (2025-12-12T07:17:42Z)
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs [72.425061028374]
We introduce OmniVideoBench, a benchmark dedicated to assessing synergistic audio-visual understanding.<n> OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces.<n>We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.
arXiv Detail & Related papers (2025-10-12T16:34:00Z)
MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models [56.49314029765706]
We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically.<n>MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances.<n>We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos.
arXiv Detail & Related papers (2025-09-10T12:34:07Z)
SiLVR: A Simple Language-based Video Reasoning Framework [71.77141065418238]
We present SiLVR, a Simple Language-based Video Reasoning framework.<n>In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs.<n>In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks.
arXiv Detail & Related papers (2025-05-30T17:59:19Z)
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval [24.203328970223527]
We present CaReBench, a testing benchmark for fine-grained video captioning and retrieval.<n>Uniquely, it provides manually separated spatial annotations and temporal annotations for each video.<n>Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z)
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? [51.15549963453873]
VidComposition is a benchmark to evaluate the video composition understanding capabilities of Multimodal Large Language Models (MLLMs)<n>It includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc.<n>Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities.
arXiv Detail & Related papers (2024-11-17T06:23:46Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)<n>We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.<n>We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video. Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.