EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset
- URL: http://arxiv.org/abs/2310.10967v1
- Date: Tue, 17 Oct 2023 03:28:29 GMT
- Title: EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset
- Authors: Hang Yin, Pinren Lu, Ziang Li, Bin Sun, Kan Li
- Abstract summary: We propose a Multimodal Data Construction Framework (MDCF) to alleviate the significant human and resource expenditure in data collection.
MDCF automatically provides explanation for a given image and its corresponding dialogue, which can provide a certain degree of interpretability.
Experiments indicate a positive correlation between the model's ability to generate accurate understandings and high-quality responses.
- Score: 20.445453185198186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The need for high-quality data has been a key issue hindering the research of
dialogue tasks. Recent studies try to build datasets through manual, web
crawling, and large pre-trained models. However, man-made data is expensive and
data collected from the internet often includes generic responses, meaningless
statements, and toxic dialogues. Automatic data generation through large models
is a cost-effective method, but for open-domain multimodal dialogue tasks,
there are still three drawbacks: 1) There is currently no open-source large
model that can accept multimodal input; 2) The content generated by the model
lacks interpretability; 3) The generated data is usually difficult to quality
control and require extensive resource to collect. To alleviate the significant
human and resource expenditure in data collection, we propose a Multimodal Data
Construction Framework (MDCF). MDCF designs proper prompts to spur the
large-scale pre-trained language model to generate well-formed and satisfactory
content. Additionally, MDCF also automatically provides explanation for a given
image and its corresponding dialogue, which can provide a certain degree of
interpretability and facilitate manual follow-up quality inspection. Based on
this, we release an Explanatory Multimodal Open-Domain dialogue dataset
(EXMODD). Experiments indicate a positive correlation between the model's
ability to generate accurate understandings and high-quality responses. Our
code and data can be found at https://github.com/poplpr/EXMODD.
Related papers
- What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.
Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.
We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.
Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags [28.368960723666458]
Multimodal Large Language Models (MLLMs) struggle with critical problems when required to provide a precise and detailed response to a visual instruction.
We show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data.
We propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information.
arXiv Detail & Related papers (2024-06-16T08:20:12Z) - Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning [0.0]
Existing document understanding models tend to generate answers with a single word or phrase directly.
We use Multi-modal Large Language Models (MLLMs) to generate step-wise question-and-answer pairs for document images.
We then use the generated high-quality data to train a humanized document understanding and reasoning model, dubbed DocAssistant.
arXiv Detail & Related papers (2024-02-26T01:17:50Z) - WanJuan: A Comprehensive Multimodal Dataset for Advancing English and
Chinese Large Models [69.96148259273065]
"Wan Juan" is a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources.
It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.
arXiv Detail & Related papers (2023-08-21T14:40:48Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - q2d: Turning Questions into Dialogs to Teach Models How to Search [11.421839177607147]
We propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from questions.
Unlike previous approaches which relied on human written dialogs with search queries, our method allows to automatically generate query-based grounded dialogs with better control and scale.
arXiv Detail & Related papers (2023-04-27T16:39:15Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - A Model-Agnostic Data Manipulation Method for Persona-based Dialogue
Generation [107.82729587882397]
It is expensive to scale up current persona-based dialogue datasets.
Each data sample in this task is more complex to learn with than conventional dialogue data.
We propose a data manipulation method, which is model-agnostic to be packed with any persona-based dialogue generation model.
arXiv Detail & Related papers (2022-04-21T03:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.