OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding
- URL: http://arxiv.org/abs/2506.18372v1
- Date: Mon, 23 Jun 2025 07:57:38 GMT
- Title: OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding
- Authors: Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le,
- Abstract summary: OpenEvents V1 is a large-scale benchmark dataset aimed at advancing event-centric vision-language understanding.<n>The dataset contains over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian.
- Score: 15.044907078726803
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce OpenEvents V1, a large-scale benchmark dataset aimed at advancing event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that emphasize surface-level descriptions, OpenEvents V1 focuses on contextual and temporal grounding through two primary tasks: (1) generating rich, event-aware image captions and (2) retrieving event-relevant images based on narrative-style textual queries. The dataset contains over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for both tasks. OpenEvents V1 establishes a robust foundation for developing multimodal models capable of deep reasoning over complex real-world events. The dataset is available at https://ltnghia.github.io/eventa/openevents-v1
Related papers
- EventVL: Understand Event Streams via Multimodal Large Language Model [18.57504605615107]
We propose EventVL, the first generative event-based MLLM framework for explicit semantic understanding.<n> Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset.<n>To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events.
arXiv Detail & Related papers (2025-01-23T14:37:21Z) - Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - CEIA: CLIP-Based Event-Image Alignment for Open-World Event-Based Understanding [52.67839570524888]
We present CEIA, an effective framework for open-world event-based understanding.
We leverage the rich event-image datasets to learn an event embedding space aligned with the image space of CLIP.
CEIA offers two distinct advantages. First, it allows us to take full advantage of the existing event-image datasets to make up the shortage of large-scale event-text datasets.
arXiv Detail & Related papers (2024-07-09T07:26:15Z) - GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role Labeling [89.07386210297373]
GenEARL is a training-free generative framework that harnesses the power of modern generative models to understand event task descriptions.
We show that GenEARL outperforms the contrastive pretraining (CLIP) baseline by 9.4% and 14.2% accuracy for zero-shot EARL on the M2E2 and SwiG datasets.
arXiv Detail & Related papers (2024-04-07T00:28:13Z) - GET: Group Event Transformer for Event-Based Vision [82.312736707534]
Event cameras are a type of novel neuromorphic sen-sor that has been gaining increasing attention.
We propose a novel Group-based vision Transformer backbone for Event-based vision, called Group Event Transformer (GET)
GET de-couples temporal-polarity information from spatial infor-mation throughout the feature extraction process.
arXiv Detail & Related papers (2023-10-04T08:02:33Z) - EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding [7.797154022794006]
EventBind is a novel framework that unleashes the potential of vision-language models (VLMs) for event-based recognition.
We first introduce a novel event encoder that subtly models the temporal information from events.
We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability.
arXiv Detail & Related papers (2023-08-06T15:05:42Z) - Title2Event: Benchmarking Open Event Extraction with a Large-scale
Chinese Title Dataset [19.634367718707857]
We present Title2Event, a large-scale sentence-level dataset benchmarking Open Event Extraction without restricting event types.
Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages.
To the best of our knowledge, it is currently the largest manually-annotated Chinese dataset for open event extraction.
arXiv Detail & Related papers (2022-11-02T04:39:36Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across
Modalities [43.048896440009784]
We propose the task of extracting event hierarchies from multimodal (video and text) data.
This reveals the structure of events and is critical to understanding them.
We show the limitations of state-of-the-art unimodal and multimodal baselines on this task.
arXiv Detail & Related papers (2022-06-14T23:24:15Z) - CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.