FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering
- URL: http://arxiv.org/abs/2412.07030v3
- Date: Sat, 01 Feb 2025 10:26:39 GMT
- Title: FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering
- Authors: Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji,
- Abstract summary: We propose a novel methodology for creating a high-quality dataset that enables training models for multimodal multihop question answering.
Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data.
Our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average.
- Score: 21.545569307511183
- License:
- Abstract: Multimodal multihop question answering is a complex task that requires reasoning over multiple sources of information, such as images and text, to answer questions. While there has been significant progress in visual question answering, the multihop setting remains unexplored due to the lack of high-quality datasets. Current methods focus on single-hop question answering or a single modality, which makes them unsuitable for real-world scenarios such as analyzing multimodal educational materials, summarizing lengthy academic articles, or interpreting scientific studies that combine charts, images, and text. To address this gap, we propose a novel methodology, introducing the first framework for creating a high-quality dataset that enables training models for multimodal multihop question answering. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks, our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average. We believe our data synthesis method will serve as a strong foundation for training and evaluating multimodal multihop question answering models.
Related papers
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.
However, the limited labeled multimodal data often hinders embedding performance.
Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.
Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.
We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.
Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - Training on Synthetic Data Beats Real Data in Multimodal Relation
Extraction [8.038421100401132]
In this paper, we consider a novel problem setting, where only unimodal data, either text or image, are available during training.
We aim to train a multimodal relation from synthetic data that perform well on real multimodal test data.
Our best model trained on completely synthetic images outperforms prior state-of-the-art models trained on real multimodal data by a margin of 3.76% in F1.
arXiv Detail & Related papers (2023-12-05T08:11:34Z) - Multimodal Graph Learning for Generative Tasks [89.44810441463652]
Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize.
We propose Multimodal Graph Learning (MMGL), a framework for capturing information from multiple multimodal neighbors with relational structures among them.
arXiv Detail & Related papers (2023-10-11T13:25:03Z) - Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset
and Comprehensive Framework [51.44863255495668]
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence.
We present Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions.
We propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders.
arXiv Detail & Related papers (2023-07-24T08:58:25Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - How Well Do Multi-hop Reading Comprehension Models Understand Date
Information? [31.243088887839257]
The ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear.
It is also unclear how questions about the internal reasoning process are useful for training and evaluating question-answering (QA) systems.
arXiv Detail & Related papers (2022-10-11T07:24:07Z) - Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
arXiv Detail & Related papers (2021-12-27T10:00:16Z) - SRQA: Synthetic Reader for Factoid Question Answering [21.28441702154528]
We introduce a new model called SRQA, which means Synthetic Reader for Factoid Question Answering.
This model enhances the question answering system in the multi-document scenario from three aspects.
We perform SRQA on the WebQA dataset, and experiments show that our model outperforms the state-of-the-art models.
arXiv Detail & Related papers (2020-09-02T13:16:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.