Cross-media Structured Common Space for Multimedia Event Extraction
- URL: http://arxiv.org/abs/2005.02472v1
- Date: Tue, 5 May 2020 20:21:53 GMT
- Title: Cross-media Structured Common Space for Multimedia Event Extraction
- Authors: Manling Li, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng
Ji, Shih-Fu Chang
- Abstract summary: We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents.
We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information into a common embedding space.
By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.
- Score: 82.36301617438268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to
extract events and their arguments from multimedia documents. We develop the
first benchmark and collect a dataset of 245 multimedia news articles with
extensively annotated events and arguments. We propose a novel method, Weakly
Aligned Structured Embedding (WASE), that encodes structured representations of
semantic information from textual and visual data into a common embedding
space. The structures are aligned across modalities by employing a weakly
supervised training strategy, which enables exploiting available resources
without explicit cross-media annotation. Compared to uni-modal state-of-the-art
methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text
event argument role labeling and visual event extraction. Compared to
state-of-the-art multimedia unstructured representations, we achieve 8.3% and
5.0% absolute F-score gains on multimedia event extraction and argument role
labeling, respectively. By utilizing images, we extract 21.4% more event
mentions than traditional text-only methods.
Related papers
- Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling [4.160176518973659]
We introduce a unified template filling model that connects the textual and visual modalities via textual prompts.
Our system surpasses the current SOTA on textual EAE by +7% F1, and performs generally better than the second-best systems for multimedia EAE.
arXiv Detail & Related papers (2024-06-18T09:14:17Z) - Multimodal Chaptering for Long-Form TV Newscast Video [0.0]
Our method integrates both audio and visual cues through a two-stage process involving frozen neural networks and a trained LSTM network.
Our proposed model has been evaluated on a diverse dataset comprising over 500 TV newscast videos of an average of 41 minutes gathered from TF1, a French TV channel.
Experimental results demonstrate that this innovative fusion strategy achieves state of the art performance, yielding a high precision rate of 82% at IoU of 90%.
arXiv Detail & Related papers (2024-03-20T08:39:41Z) - Training Multimedia Event Extraction With Generated Images and Captions [6.291564630983316]
We propose Cross-modality Augmented Multimedia Event Learning (CAMEL)
We start with two labeled unimodal datasets in text and image respectively, and generate the missing modality using off-the-shelf image generators like Stable Diffusion and image captioners like BLIP.
In order to learn robust features that are effective across domains, we devise an iterative and gradual training strategy.
arXiv Detail & Related papers (2023-06-15T09:01:33Z) - Semantics-Consistent Cross-domain Summarization via Optimal Transport
Alignment [80.18786847090522]
We propose a Semantics-Consistent Cross-domain Summarization model based on optimal transport alignment with visual and textual segmentation.
We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
arXiv Detail & Related papers (2022-10-10T14:27:10Z) - M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval [34.343617836027725]
We propose a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and each modality content in videos.
Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.
arXiv Detail & Related papers (2022-08-16T10:51:37Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z) - Joint Multimedia Event Extraction from Video and Article [51.159034070824056]
We propose the first approach to jointly extract events from video and text articles.
First, we propose the first self-supervised multimodal event coreference model.
Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents.
arXiv Detail & Related papers (2021-09-27T03:22:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.