Training on Synthetic Data Beats Real Data in Multimodal Relation
Extraction
- URL: http://arxiv.org/abs/2312.03025v1
- Date: Tue, 5 Dec 2023 08:11:34 GMT
- Title: Training on Synthetic Data Beats Real Data in Multimodal Relation
Extraction
- Authors: Zilin Du, Haoxin Li, Xu Guo, Boyang Li
- Abstract summary: In this paper, we consider a novel problem setting, where only unimodal data, either text or image, are available during training.
We aim to train a multimodal relation from synthetic data that perform well on real multimodal test data.
Our best model trained on completely synthetic images outperforms prior state-of-the-art models trained on real multimodal data by a margin of 3.76% in F1.
- Score: 8.038421100401132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of multimodal relation extraction has attracted significant research
attention, but progress is constrained by the scarcity of available training
data. One natural thought is to extend existing datasets with cross-modal
generative models. In this paper, we consider a novel problem setting, where
only unimodal data, either text or image, are available during training. We aim
to train a multimodal classifier from synthetic data that perform well on real
multimodal test data. However, training with synthetic data suffers from two
obstacles: lack of data diversity and label information loss. To alleviate the
issues, we propose Mutual Information-aware Multimodal Iterated Relational dAta
GEneration (MI2RAGE), which applies Chained Cross-modal Generation (CCG) to
promote diversity in the generated data and exploits a teacher network to
select valuable training samples with high mutual information with the
ground-truth labels. Comparing our method to direct training on synthetic data,
we observed a significant improvement of 24.06% F1 with synthetic text and
26.42% F1 with synthetic images. Notably, our best model trained on completely
synthetic images outperforms prior state-of-the-art models trained on real
multimodal data by a margin of 3.76% in F1. Our codebase will be made available
upon acceptance.
Related papers
- Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data? [8.775988650381397]
Training medical vision-language pre-training models requires datasets with paired, high-quality image-text data.
Recent advancements in Large Language Models have made it possible to generate large-scale synthetic image-text pairs.
We propose an automated pipeline to build a diverse, high-quality synthetic dataset.
arXiv Detail & Related papers (2024-10-17T13:11:07Z) - Multimodal Misinformation Detection by Learning from Synthetic Data with Multimodal LLMs [13.684959490938269]
We propose learning from synthetic data for detecting real-world multimodal misinformation through two model-agnostic data selection methods.
Experiments show that our method enhances the performance of a small MLLM on real-world fact-checking datasets.
arXiv Detail & Related papers (2024-09-29T11:01:14Z) - SAU: A Dual-Branch Network to Enhance Long-Tailed Recognition via Generative Models [9.340077455871736]
Long-tailed distributions in image recognition pose a considerable challenge due to the severe imbalance between a few dominant classes.
Recently, the use of large generative models to create synthetic data for image classification has been realized.
We propose the use of synthetic data as a complement to long-tailed datasets to eliminate the impact of data imbalance.
arXiv Detail & Related papers (2024-08-29T05:33:59Z) - MDM: Advancing Multi-Domain Distribution Matching for Automatic Modulation Recognition Dataset Synthesis [35.07663680944459]
Deep learning technology has been successfully introduced into Automatic Modulation Recognition (AMR) tasks.
The success of deep learning is all attributed to the training on large-scale datasets.
In order to solve the problem of large amount of data, some researchers put forward the method of data distillation.
arXiv Detail & Related papers (2024-08-05T14:16:54Z) - UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human
Generation [59.77275587857252]
A holistic human dataset inevitably has insufficient and low-resolution information on local parts.
We propose to use multi-source datasets with various resolution images to jointly learn a high-resolution human generative model.
arXiv Detail & Related papers (2023-09-25T17:58:46Z) - Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts.
We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data.
Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - Training Multimedia Event Extraction With Generated Images and Captions [6.291564630983316]
We propose Cross-modality Augmented Multimedia Event Learning (CAMEL)
We start with two labeled unimodal datasets in text and image respectively, and generate the missing modality using off-the-shelf image generators like Stable Diffusion and image captioners like BLIP.
In order to learn robust features that are effective across domains, we devise an iterative and gradual training strategy.
arXiv Detail & Related papers (2023-06-15T09:01:33Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - FedDM: Iterative Distribution Matching for Communication-Efficient
Federated Learning [87.08902493524556]
Federated learning(FL) has recently attracted increasing attention from academia and industry.
We propose FedDM to build the global training objective from multiple local surrogate functions.
In detail, we construct synthetic sets of data on each client to locally match the loss landscape from original data.
arXiv Detail & Related papers (2022-07-20T04:55:18Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.