SPRING: Situated Conversation Agent Pretrained with Multimodal Questions
from Incremental Layout Graph
- URL: http://arxiv.org/abs/2301.01949v1
- Date: Thu, 5 Jan 2023 08:03:47 GMT
- Title: SPRING: Situated Conversation Agent Pretrained with Multimodal Questions
from Incremental Layout Graph
- Authors: Yuxing Long, Binyuan Hui, Fulong Ye, Yanyang Li, Zhuoxin Han, Caixia
Yuan, Yongbin Li, Xiaojie Wang
- Abstract summary: We propose a Situated Conversation Agent Petrained with Multimodal Questions from INcremental Layout Graph (SPRING)
All QA pairs utilized during pretraining are generated from novel Incremental Layout Graphs (ILG)
Experimental results verify the SPRING's effectiveness, showing that it significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets.
- Score: 16.275155481031348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing multimodal conversation agents have shown impressive abilities to
locate absolute positions or retrieve attributes in simple scenarios, but they
fail to perform well when complex relative positions and information alignments
are involved, which poses a bottleneck in response quality. In this paper, we
propose a Situated Conversation Agent Petrained with Multimodal Questions from
INcremental Layout Graph (SPRING) with abilities of reasoning multi-hops
spatial relations and connecting them with visual attributes in crowded
situated scenarios. Specifically, we design two types of Multimodal Question
Answering (MQA) tasks to pretrain the agent. All QA pairs utilized during
pretraining are generated from novel Incremental Layout Graphs (ILG). QA pair
difficulty labels automatically annotated by ILG are used to promote MQA-based
Curriculum Learning. Experimental results verify the SPRING's effectiveness,
showing that it significantly outperforms state-of-the-art approaches on both
SIMMC 1.0 and SIMMC 2.0 datasets.
Related papers
- Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)
We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart [26.54501344351476]
We present C$textT2$C-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts.
Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data.
arXiv Detail & Related papers (2024-10-28T18:13:14Z) - Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task
Adaptation [45.90925587972781]
Large Language Models (LLMs) have the ability to solve a variety of tasks, such as text summarization and mathematical questions.
Due to high computational costs, the current trend is to use prompt instruction tuning to better adjust monolithic, pretrained LLMs for new -- but often individual -- downstream tasks.
MoPs can simultaneously mitigate prompt training "interference" in multi-task, multi-source scenarios.
arXiv Detail & Related papers (2023-10-04T14:11:12Z) - Blind Image Quality Assessment via Vision-Language Correspondence: A
Multitask Learning Perspective [93.56647950778357]
Blind image quality assessment (BIQA) predicts the human perception of image quality without any reference information.
We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks.
arXiv Detail & Related papers (2023-03-27T07:58:09Z) - UniKGQA: Unified Retrieval and Reasoning for Solving Multi-hop Question
Answering Over Knowledge Graph [89.98762327725112]
Multi-hop Question Answering over Knowledge Graph(KGQA) aims to find the answer entities that are multiple hops away from the topic entities mentioned in a natural language question.
We propose UniKGQA, a novel approach for multi-hop KGQA task, by unifying retrieval and reasoning in both model architecture and parameter learning.
arXiv Detail & Related papers (2022-12-02T04:08:09Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided
Multimodal Attention for Textbook Question Answering [7.367945534481411]
We propose a novel model named MoCA, which incorporates multi-stage domain pretraining and multimodal cross attention for the Textbook Question Answering task.
The experimental results show the superiority of our model, which outperforms the state-of-the-art methods by 2.21% and 2.43% for validation and test split respectively.
arXiv Detail & Related papers (2021-12-06T07:58:53Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.