Joint Multimedia Event Extraction from Video and Article
- URL: http://arxiv.org/abs/2109.12776v1
- Date: Mon, 27 Sep 2021 03:22:12 GMT
- Title: Joint Multimedia Event Extraction from Video and Article
- Authors: Brian Chen, Xudong Lin, Christopher Thomas, Manling Li, Shoya Yoshida,
Lovish Chum, Heng Ji, and Shih-Fu Chang
- Abstract summary: We propose the first approach to jointly extract events from video and text articles.
First, we propose the first self-supervised multimodal event coreference model.
Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents.
- Score: 51.159034070824056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual and textual modalities contribute complementary information about
events described in multimedia documents. Videos contain rich dynamics and
detailed unfoldings of events, while text describes more high-level and
abstract concepts. However, existing event extraction methods either do not
handle video or solely target video while ignoring other modalities. In
contrast, we propose the first approach to jointly extract events from video
and text articles. We introduce the new task of Video MultiMedia Event
Extraction (Video M2E2) and propose two novel components to build the first
system towards this task. First, we propose the first self-supervised
multimodal event coreference model that can determine coreference between video
events and text events without any manually annotated pairs. Second, we
introduce the first multimodal transformer which extracts structured event
information jointly from both videos and text documents. We also construct and
will publicly release a new benchmark of video-article pairs, consisting of 860
video-article pairs with extensive annotations for evaluating methods on this
task. Our experimental results demonstrate the effectiveness of our proposed
method on our new benchmark dataset. We achieve 6.0% and 5.8% absolute F-score
gain on multimodal event coreference resolution and multimedia event
extraction.
Related papers
- Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Towards Event-oriented Long Video Understanding [101.48089908037888]
Event-Bench is an event-oriented long video understanding benchmark built on existing datasets and human annotations.
VIM is a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions.
arXiv Detail & Related papers (2024-06-20T09:14:19Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across
Modalities [43.048896440009784]
We propose the task of extracting event hierarchies from multimodal (video and text) data.
This reveals the structure of events and is critical to understanding them.
We show the limitations of state-of-the-art unimodal and multimodal baselines on this task.
arXiv Detail & Related papers (2022-06-14T23:24:15Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z) - GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video
Summarization [18.543372365239673]
The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator.
Results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.
arXiv Detail & Related papers (2021-04-26T10:50:37Z) - Cross-media Structured Common Space for Multimedia Event Extraction [82.36301617438268]
We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents.
We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information into a common embedding space.
By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.
arXiv Detail & Related papers (2020-05-05T20:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.