MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling
- URL: http://arxiv.org/abs/2406.12420v1
- Date: Tue, 18 Jun 2024 09:14:17 GMT
- Title: MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling
- Authors: Philipp Seeberger, Dominik Wagner, Korbinian Riedhammer,
- Abstract summary: We introduce a unified template filling model that connects the textual and visual modalities via textual prompts.
Our system surpasses the current SOTA on textual EAE by +7% F1, and performs generally better than the second-best systems for multimedia EAE.
- Score: 4.160176518973659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advancement of multimedia technologies, news documents and user-generated content are often represented as multiple modalities, making Multimedia Event Extraction (MEE) an increasingly important challenge. However, recent MEE methods employ weak alignment strategies and data augmentation with simple classification models, which ignore the capabilities of natural language-formulated event templates for the challenging Event Argument Extraction (EAE) task. In this work, we focus on EAE and address this issue by introducing a unified template filling model that connects the textual and visual modalities via textual prompts. This approach enables the exploitation of cross-ontology transfer and the incorporation of event-specific semantics. Experiments on the M2E2 benchmark demonstrate the effectiveness of our approach. Our system surpasses the current SOTA on textual EAE by +7% F1, and performs generally better than the second-best systems for multimedia EAE.
Related papers
- EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts [48.214475133206385]
EMMA is a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA.
By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts.
arXiv Detail & Related papers (2024-06-13T14:26:43Z) - Meta-Task Prompting Elicits Embeddings from Large Language Models [54.757445048329735]
We introduce a new unsupervised text embedding method, Meta-Task Prompting with Explicit One-Word Limitation.
We generate high-quality sentence embeddings from Large Language Models without the need for model fine-tuning.
Our findings suggest a new scaling law, offering a versatile and resource-efficient approach for embedding generation across diverse scenarios.
arXiv Detail & Related papers (2024-02-28T16:35:52Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains.
Our method contains video and textual segmentation and summarization module, respectively.
It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z) - MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion
Recognition in Conversation [32.15124603618625]
We propose a new model based on multimodal fused graph convolutional network, MMGCN, in this work.
MMGCN can not only make use of multimodal dependencies effectively, but also leverage speaker information to model inter-speaker and intra-speaker dependency.
We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effectiveness of MMGCN.
arXiv Detail & Related papers (2021-07-14T15:37:02Z) - VMSMO: Learning to Generate Multimodal Summary for Video-based News
Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO)
The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article.
We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z) - Visual Semantic Multimedia Event Model for Complex Event Detection in
Video Streams [5.53329677986653]
Middleware systems such as complex event processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion.
We present a visual event specification method to enable complex structured event processing by creating a structured knowledge representation from low-level media streams.
arXiv Detail & Related papers (2020-09-30T09:22:23Z) - Cross-media Structured Common Space for Multimedia Event Extraction [82.36301617438268]
We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents.
We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information into a common embedding space.
By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.
arXiv Detail & Related papers (2020-05-05T20:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.