Training on Synthetic Data Beats Real Data in Multimodal Relation
Extraction
- URL: http://arxiv.org/abs/2312.03025v1
- Date: Tue, 5 Dec 2023 08:11:34 GMT
- Title: Training on Synthetic Data Beats Real Data in Multimodal Relation
Extraction
- Authors: Zilin Du, Haoxin Li, Xu Guo, Boyang Li
- Abstract summary: In this paper, we consider a novel problem setting, where only unimodal data, either text or image, are available during training.
We aim to train a multimodal relation from synthetic data that perform well on real multimodal test data.
Our best model trained on completely synthetic images outperforms prior state-of-the-art models trained on real multimodal data by a margin of 3.76% in F1.
- Score: 8.038421100401132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of multimodal relation extraction has attracted significant research
attention, but progress is constrained by the scarcity of available training
data. One natural thought is to extend existing datasets with cross-modal
generative models. In this paper, we consider a novel problem setting, where
only unimodal data, either text or image, are available during training. We aim
to train a multimodal classifier from synthetic data that perform well on real
multimodal test data. However, training with synthetic data suffers from two
obstacles: lack of data diversity and label information loss. To alleviate the
issues, we propose Mutual Information-aware Multimodal Iterated Relational dAta
GEneration (MI2RAGE), which applies Chained Cross-modal Generation (CCG) to
promote diversity in the generated data and exploits a teacher network to
select valuable training samples with high mutual information with the
ground-truth labels. Comparing our method to direct training on synthetic data,
we observed a significant improvement of 24.06% F1 with synthetic text and
26.42% F1 with synthetic images. Notably, our best model trained on completely
synthetic images outperforms prior state-of-the-art models trained on real
multimodal data by a margin of 3.76% in F1. Our codebase will be made available
upon acceptance.
Related papers
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.
However, the limited labeled multimodal data often hinders embedding performance.
Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z) - FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering [21.545569307511183]
We propose a novel methodology for creating a high-quality dataset that enables training models for multimodal multihop question answering.
Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data.
Our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average.
arXiv Detail & Related papers (2024-12-09T22:35:44Z) - Multimodal Misinformation Detection by Learning from Synthetic Data with Multimodal LLMs [13.684959490938269]
We propose learning from synthetic data for detecting real-world multimodal misinformation through two model-agnostic data selection methods.
Experiments show that our method enhances the performance of a small MLLM on real-world fact-checking datasets.
arXiv Detail & Related papers (2024-09-29T11:01:14Z) - MDM: Advancing Multi-Domain Distribution Matching for Automatic Modulation Recognition Dataset Synthesis [35.07663680944459]
Deep learning technology has been successfully introduced into Automatic Modulation Recognition (AMR) tasks.
The success of deep learning is all attributed to the training on large-scale datasets.
In order to solve the problem of large amount of data, some researchers put forward the method of data distillation.
arXiv Detail & Related papers (2024-08-05T14:16:54Z) - Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts.
We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data.
Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - Training Multimedia Event Extraction With Generated Images and Captions [6.291564630983316]
We propose Cross-modality Augmented Multimedia Event Learning (CAMEL)
We start with two labeled unimodal datasets in text and image respectively, and generate the missing modality using off-the-shelf image generators like Stable Diffusion and image captioners like BLIP.
In order to learn robust features that are effective across domains, we devise an iterative and gradual training strategy.
arXiv Detail & Related papers (2023-06-15T09:01:33Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - FedDM: Iterative Distribution Matching for Communication-Efficient
Federated Learning [87.08902493524556]
Federated learning(FL) has recently attracted increasing attention from academia and industry.
We propose FedDM to build the global training objective from multiple local surrogate functions.
In detail, we construct synthetic sets of data on each client to locally match the loss landscape from original data.
arXiv Detail & Related papers (2022-07-20T04:55:18Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.