EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs
- URL: http://arxiv.org/abs/2511.18448v1
- Date: Sun, 23 Nov 2025 13:39:01 GMT
- Title: EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs
- Authors: Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xiangyang Ji,
- Abstract summary: EventBench is a benchmark that offers eight diverse task metrics together with a large-scale event stream dataset.<n>We evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, and event-based MLLMs such as EventGPT.
- Score: 53.41154446399572
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) have made significant advancements in event-based vision, yet the comprehensive evaluation of their capabilities within a unified benchmark remains largely unexplored. In this work, we introduce EventBench, a benchmark that offers eight diverse task metrics together with a large-scale event stream dataset. EventBench differs from existing event-based benchmarks in four key aspects: (1) openness in accessibility, releasing all raw event streams and task instructions across eight evaluation metrics; (2) diversity in task coverage, spanning understanding, recognition, and spatial reasoning tasks for comprehensive capability assessment; (3) integration in spatial dimensions, pioneering the design of 3D spatial reasoning tasks for event-based MLLMs; and (4) scale in data volume, with an accompanying training set of over one million event-text pairs supporting large-scale training and evaluation. Using EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, and event-based MLLMs such as EventGPT that directly process raw event streams. Extensive evaluation reveals that while current event-based MLLMs demonstrate strong performance in event stream understanding, they continue to struggle with fine-grained recognition and spatial reasoning.
Related papers
- SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z) - LLM-EvRep: Learning an LLM-Compatible Event Representation Using a Self-Supervised Framework [11.30784253260618]
Large language models (LLMs) have exhibited remarkable zero-shot capabilities across diverse domains.<n>We propose textbfLLM-EvGen, an event representation generator that produces event representations textbfLLM-EvRep<n> Comprehensive experiments were conducted on three datasets: N-ImageNet, N-Caltech101, and N-MNIST.
arXiv Detail & Related papers (2025-02-20T05:18:36Z) - From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning [71.41062111470414]
Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities.<n>We present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding.<n>Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive task-specific training.
arXiv Detail & Related papers (2025-02-09T10:30:54Z) - EventVL: Understand Event Streams via Multimodal Large Language Model [29.23525787969373]
We propose EventVL, the first generative event-based MLLM framework for explicit semantic understanding.<n> Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset.<n>To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events.
arXiv Detail & Related papers (2025-01-23T14:37:21Z) - Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Towards Event-oriented Long Video Understanding [101.48089908037888]
Event-Bench is an event-oriented long video understanding benchmark built on existing datasets and human annotations.
VIM is a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions.
arXiv Detail & Related papers (2024-06-20T09:14:19Z) - EvEval: A Comprehensive Evaluation of Event Semantics for Large Language
Models [31.704144542866636]
Events serve as fundamental units of occurrence within various contexts.
Recent studies have begun leveraging large language models (LLMs) to address event semantic processing.
We propose an overarching framework for event semantic processing, encompassing understanding, reasoning, and prediction.
arXiv Detail & Related papers (2023-05-24T15:55:40Z) - PILED: An Identify-and-Localize Framework for Few-Shot Event Detection [79.66042333016478]
In our study, we employ cloze prompts to elicit event-related knowledge from pretrained language models.
We minimize the number of type-specific parameters, enabling our model to quickly adapt to event detection tasks for new types.
arXiv Detail & Related papers (2022-02-15T18:01:39Z) - Learning Constraints and Descriptive Segmentation for Subevent Detection [74.48201657623218]
We propose an approach to learning and enforcing constraints that capture dependencies between subevent detection and EventSeg prediction.
We adopt Rectifier Networks for constraint learning and then convert the learned constraints to a regularization term in the loss function of the neural model.
arXiv Detail & Related papers (2021-09-13T20:50:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.