Training Multimedia Event Extraction With Generated Images and Captions
- URL: http://arxiv.org/abs/2306.08966v2
- Date: Fri, 11 Aug 2023 04:55:40 GMT
- Title: Training Multimedia Event Extraction With Generated Images and Captions
- Authors: Zilin Du, Yunxin Li, Xu Guo, Yidan Sun, Boyang Li
- Abstract summary: We propose Cross-modality Augmented Multimedia Event Learning (CAMEL)
We start with two labeled unimodal datasets in text and image respectively, and generate the missing modality using off-the-shelf image generators like Stable Diffusion and image captioners like BLIP.
In order to learn robust features that are effective across domains, we devise an iterative and gradual training strategy.
- Score: 6.291564630983316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary news reporting increasingly features multimedia content,
motivating research on multimedia event extraction. However, the task lacks
annotated multimodal training data and artificially generated training data
suffer from distribution shift from real-world data. In this paper, we propose
Cross-modality Augmented Multimedia Event Learning (CAMEL), which successfully
utilizes artificially generated multimodal training data and achieves
state-of-the-art performance. We start with two labeled unimodal datasets in
text and image respectively, and generate the missing modality using
off-the-shelf image generators like Stable Diffusion and image captioners like
BLIP. After that, we train the network on the resultant multimodal datasets. In
order to learn robust features that are effective across domains, we devise an
iterative and gradual training strategy. Substantial experiments show that
CAMEL surpasses state-of-the-art (SOTA) baselines on the M2E2 benchmark. On
multimedia events in particular, we outperform the prior SOTA by 4.2% F1 on
event mention identification and by 9.8% F1 on argument identification, which
indicates that CAMEL learns synergistic representations from the two
modalities. Our work demonstrates a recipe to unleash the power of synthetic
training data in structured prediction.
Related papers
- MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - Training on Synthetic Data Beats Real Data in Multimodal Relation
Extraction [8.038421100401132]
In this paper, we consider a novel problem setting, where only unimodal data, either text or image, are available during training.
We aim to train a multimodal relation from synthetic data that perform well on real multimodal test data.
Our best model trained on completely synthetic images outperforms prior state-of-the-art models trained on real multimodal data by a margin of 3.76% in F1.
arXiv Detail & Related papers (2023-12-05T08:11:34Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Vision Learners Meet Web Image-Text Pairs [32.36188289972377]
In this work, we consider self-supervised pre-training on noisy web sourced image-text paired data.
We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training.
We present a new visual representation pre-training method, MUlti-modal Generator(MUG), that learns from scalable web sourced image-text data.
arXiv Detail & Related papers (2023-01-17T18:53:24Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - Leveraging Uni-Modal Self-Supervised Learning for Multimodal
Audio-Visual Speech Recognition [23.239078852797817]
We leverage uni-modal self-supervised learning to promote the multimodal audio-visual speech recognition (AVSR)
In particular, we first train audio and visual encoders on a large-scale uni-modal dataset, then we integrate components of both encoders into a larger multimodal framework.
Our model is experimentally validated on both word-level and sentence-level AVSR tasks.
arXiv Detail & Related papers (2022-02-24T15:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.