Training Multimedia Event Extraction With Generated Images and Captions
- URL: http://arxiv.org/abs/2306.08966v2
- Date: Fri, 11 Aug 2023 04:55:40 GMT
- Title: Training Multimedia Event Extraction With Generated Images and Captions
- Authors: Zilin Du, Yunxin Li, Xu Guo, Yidan Sun, Boyang Li
- Abstract summary: We propose Cross-modality Augmented Multimedia Event Learning (CAMEL)
We start with two labeled unimodal datasets in text and image respectively, and generate the missing modality using off-the-shelf image generators like Stable Diffusion and image captioners like BLIP.
In order to learn robust features that are effective across domains, we devise an iterative and gradual training strategy.
- Score: 6.291564630983316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary news reporting increasingly features multimedia content,
motivating research on multimedia event extraction. However, the task lacks
annotated multimodal training data and artificially generated training data
suffer from distribution shift from real-world data. In this paper, we propose
Cross-modality Augmented Multimedia Event Learning (CAMEL), which successfully
utilizes artificially generated multimodal training data and achieves
state-of-the-art performance. We start with two labeled unimodal datasets in
text and image respectively, and generate the missing modality using
off-the-shelf image generators like Stable Diffusion and image captioners like
BLIP. After that, we train the network on the resultant multimodal datasets. In
order to learn robust features that are effective across domains, we devise an
iterative and gradual training strategy. Substantial experiments show that
CAMEL surpasses state-of-the-art (SOTA) baselines on the M2E2 benchmark. On
multimedia events in particular, we outperform the prior SOTA by 4.2% F1 on
event mention identification and by 9.8% F1 on argument identification, which
indicates that CAMEL learns synergistic representations from the two
modalities. Our work demonstrates a recipe to unleash the power of synthetic
training data in structured prediction.
Related papers
- A Self-Learning Multimodal Approach for Fake News Detection [35.98977478616019]
We introduce a self-learning multimodal model for fake news classification.
The model leverages contrastive learning, a robust method for feature extraction that operates without requiring labeled data.
Our experimental results on a public dataset demonstrate that the proposed model outperforms several state-of-the-art classification approaches.
arXiv Detail & Related papers (2024-12-08T07:41:44Z) - T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - Training on Synthetic Data Beats Real Data in Multimodal Relation
Extraction [8.038421100401132]
In this paper, we consider a novel problem setting, where only unimodal data, either text or image, are available during training.
We aim to train a multimodal relation from synthetic data that perform well on real multimodal test data.
Our best model trained on completely synthetic images outperforms prior state-of-the-art models trained on real multimodal data by a margin of 3.76% in F1.
arXiv Detail & Related papers (2023-12-05T08:11:34Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Vision Learners Meet Web Image-Text Pairs [32.36188289972377]
In this work, we consider self-supervised pre-training on noisy web sourced image-text paired data.
We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training.
We present a new visual representation pre-training method, MUlti-modal Generator(MUG), that learns from scalable web sourced image-text data.
arXiv Detail & Related papers (2023-01-17T18:53:24Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - Leveraging Uni-Modal Self-Supervised Learning for Multimodal
Audio-Visual Speech Recognition [23.239078852797817]
We leverage uni-modal self-supervised learning to promote the multimodal audio-visual speech recognition (AVSR)
In particular, we first train audio and visual encoders on a large-scale uni-modal dataset, then we integrate components of both encoders into a larger multimodal framework.
Our model is experimentally validated on both word-level and sentence-level AVSR tasks.
arXiv Detail & Related papers (2022-02-24T15:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.