Related papers: Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities

Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities

URL: http://arxiv.org/abs/2206.07207v3
Date: Wed, 20 Dec 2023 03:22:02 GMT
Title: Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities
Authors: Hammad A. Ayyubi, Christopher Thomas, Lovish Chum, Rahul Lokesh, Long Chen, Yulei Niu, Xudong Lin, Xuande Feng, Jaywon Koo, Sounak Ray and Shih-Fu Chang
Abstract summary: We propose the task of extracting event hierarchies from multimodal (video and text) data. This reveals the structure of events and is critical to understanding them. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task.
Score: 43.048896440009784
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, in Figure 1, the abstract event of "war" manifests at a lower semantic level through subevents "tanks firing" (in video) and airplane "shot" (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research.

Related papers

EventVL: Understand Event Streams via Multimodal Large Language Model [18.57504605615107]
We propose EventVL, the first generative event-based MLLM framework for explicit semantic understanding. Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset. To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events.
arXiv Detail & Related papers (2025-01-23T14:37:21Z)
Enhancing Event Extraction from Short Stories through Contextualized Prompts [2.7670701972493568]
This paper presents textttVrittanta-EN, a collection of 1000 English short stories annotated for real events. Our objective is to clarify the intricate idea of events in the context of short stories. We present fresh guidelines for annotating event mentions and their categories, organized into textitseven distinct classes
arXiv Detail & Related papers (2024-12-14T08:28:52Z)
Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z)
Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM [47.786978666537436]
We propose a Two-Stage Prefix-Enhanced MLLM (TSPE) approach for event attribution in movie videos. In the local stage, we introduce an interaction-aware prefix that guides the model to focus on the relevant multimodal information within a single clip. In the global stage, we strengthen the connections between associated events using an inferential knowledge graph.
arXiv Detail & Related papers (2024-09-14T08:30:59Z)
SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies. We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events. Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments. Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z)
Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task. It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z)
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training [53.613265415703815]
We propose a unified pre-training and fine-tuning framework to enhance the inter-task association between event detection and captioning. Our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data.
arXiv Detail & Related papers (2022-07-18T14:18:13Z)
Joint Multimedia Event Extraction from Video and Article [51.159034070824056]
We propose the first approach to jointly extract events from video and text articles. First, we propose the first self-supervised multimodal event coreference model. Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents.
arXiv Detail & Related papers (2021-09-27T03:22:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.