MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided
Multimodal Attention for Textbook Question Answering
- URL: http://arxiv.org/abs/2112.02839v1
- Date: Mon, 6 Dec 2021 07:58:53 GMT
- Title: MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided
Multimodal Attention for Textbook Question Answering
- Authors: Fangzhi Xu, Qika Lin, Jun Liu, Lingling Zhang, Tianzhe Zhao, Qi Chai,
Yudai Pan
- Abstract summary: We propose a novel model named MoCA, which incorporates multi-stage domain pretraining and multimodal cross attention for the Textbook Question Answering task.
The experimental results show the superiority of our model, which outperforms the state-of-the-art methods by 2.21% and 2.43% for validation and test split respectively.
- Score: 7.367945534481411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Textbook Question Answering (TQA) is a complex multimodal task to infer
answers given large context descriptions and abundant diagrams. Compared with
Visual Question Answering (VQA), TQA contains a large number of uncommon
terminologies and various diagram inputs. It brings new challenges to the
representation capability of language model for domain-specific spans. And it
also pushes the multimodal fusion to a more complex level. To tackle the above
issues, we propose a novel model named MoCA, which incorporates multi-stage
domain pretraining and multimodal cross attention for the TQA task. Firstly, we
introduce a multi-stage domain pretraining module to conduct unsupervised
post-pretraining with the span mask strategy and supervised pre-finetune.
Especially for domain post-pretraining, we propose a heuristic generation
algorithm to employ the terminology corpus. Secondly, to fully consider the
rich inputs of context and diagrams, we propose cross-guided multimodal
attention to update the features of text, question diagram and instructional
diagram based on a progressive strategy. Further, a dual gating mechanism is
adopted to improve the model ensemble. The experimental results show the
superiority of our model, which outperforms the state-of-the-art methods by
2.21% and 2.43% for validation and test split respectively.
Related papers
- Enhancing Textbook Question Answering Task with Large Language Models
and Retrieval Augmented Generation [3.948068081583197]
This paper proposes a methodology that handle the out-of-domain scenario in Textbook question answering (TQA)
Through supervised fine-tuning of the LLM model Llama-2 and the incorporation of RAG, our architecture outperforms the baseline, achieving a 4.12% accuracy improvement on validation set and 9.84% on test set for non-diagram multiple-choice questions.
arXiv Detail & Related papers (2024-02-05T11:58:56Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - Multimodal Inverse Cloze Task for Knowledge-based Visual Question
Answering [4.114444605090133]
We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities.
KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base.
Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension.
arXiv Detail & Related papers (2023-01-11T09:16:34Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - Semantic Sentence Composition Reasoning for Multi-Hop Question Answering [1.773120658816994]
We present a semantic sentence composition reasoning approach for a multi-hop question answering task.
With the combination of factual sentences and multi-stage semantic retrieval, our approach can provide more comprehensive contextual information for model training and reasoning.
Experimental results demonstrate our model is able to incorporate existing pre-trained language models and outperform the existing SOTA method on the QASC task with an improvement of about 9%.
arXiv Detail & Related papers (2022-03-01T00:35:51Z) - Unifying Architectures, Tasks, and Modalities Through a Simple
Sequence-to-Sequence Learning Framework [83.82026345508334]
We propose OFA, a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.)
OFA achieves new state-of-the-arts on a series of multimodal tasks, including image captioning (COCO test CIDEr: 149.6), text-to-image generation (COCO test FID: 10.5), VQA (test-std encoder acc.: 80.02), SNLI-VE (test acc.: 90.
arXiv Detail & Related papers (2022-02-07T10:38:21Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Document Modeling with Graph Attention Networks for Multi-grained
Machine Reading Comprehension [127.3341842928421]
Natural Questions is a new challenging machine reading comprehension benchmark.
It has two-grained answers, which are a long answer (typically a paragraph) and a short answer (one or more entities inside the long answer)
Existing methods treat these two sub-tasks individually during training while ignoring their dependencies.
We present a novel multi-grained machine reading comprehension framework that focuses on modeling documents at their hierarchical nature.
arXiv Detail & Related papers (2020-05-12T14:20:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.