Jaeger: A Concatenation-Based Multi-Transformer VQA Model
- URL: http://arxiv.org/abs/2310.07091v2
- Date: Thu, 19 Oct 2023 04:03:08 GMT
- Title: Jaeger: A Concatenation-Based Multi-Transformer VQA Model
- Authors: Jieting Long, Zewei Shi, Penghao Jiang, Yidong Gan
- Abstract summary: Document-based Visual Question Answering poses a challenging task between linguistic sense disambiguation and fine-grained multimodal retrieval.
We propose Jaegar, a concatenation-based multi-transformer VQA model.
Our approach has the potential to amplify the performance of these models through concatenation.
- Score: 0.13654846342364307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document-based Visual Question Answering poses a challenging task between
linguistic sense disambiguation and fine-grained multimodal retrieval. Although
there has been encouraging progress in document-based question answering due to
the utilization of large language and open-world prior models\cite{1}, several
challenges persist, including prolonged response times, extended inference
durations, and imprecision in matching. In order to overcome these challenges,
we propose Jaegar, a concatenation-based multi-transformer VQA model. To derive
question features, we leverage the exceptional capabilities of RoBERTa
large\cite{2} and GPT2-xl\cite{3} as feature extractors. Subsequently, we
subject the outputs from both models to a concatenation process. This operation
allows the model to consider information from diverse sources concurrently,
strengthening its representational capability. By leveraging pre-trained models
for feature extraction, our approach has the potential to amplify the
performance of these models through concatenation. After concatenation, we
apply dimensionality reduction to the output features, reducing the model's
computational effectiveness and inference time. Empirical results demonstrate
that our proposed model achieves competitive performance on Task C of the
PDF-VQA Dataset. If the user adds any new data, they should make sure to style
it as per the instructions provided in previous sections.
Related papers
- LaSagnA: Language-based Segmentation Assistant for Complex Queries [39.620806493454616]
Large Language Models for Vision (vLLMs) generate detailed perceptual outcomes, including bounding boxes and masks.
In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries.
We present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format.
arXiv Detail & Related papers (2024-04-12T14:40:45Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Adapting Pre-trained Generative Models for Extractive Question Answering [4.993041970406846]
We introduce a novel approach that uses the power of pre-trained generative models to address extractive QA tasks.
We demonstrate the superior performance of our proposed approach compared to existing state-of-the-art models.
arXiv Detail & Related papers (2023-11-06T09:01:02Z) - Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z) - Chain-of-Skills: A Configurable Model for Open-domain Question Answering [79.8644260578301]
The retrieval model is an indispensable component for real-world knowledge-intensive tasks.
Recent work focuses on customized methods, limiting the model transferability and scalability.
We propose a modular retriever where individual modules correspond to key skills that can be reused across datasets.
arXiv Detail & Related papers (2023-05-04T20:19:39Z) - A Lightweight Constrained Generation Alternative for Query-focused
Summarization [8.264410236351111]
Query-focused summarization (QFS) aims to provide a summary of a document that satisfies information need of a given query.
We propose leveraging a recently developed constrained generation model Neurological Decoding (NLD) as an alternative to current QFS regimes.
We demonstrate the efficacy of this approach on two public QFS collections achieving near parity with the state-of-the-art model with substantially reduced complexity.
arXiv Detail & Related papers (2023-04-23T18:43:48Z) - Tokenization Consistency Matters for Generative Models on Extractive NLP
Tasks [54.306234256074255]
We identify the issue of tokenization inconsistency that is commonly neglected in training generative models.
This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently.
We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets.
arXiv Detail & Related papers (2022-12-19T23:33:21Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Enhancing Multi-modal and Multi-hop Question Answering via Structured
Knowledge and Unified Retrieval-Generation [33.56304858796142]
Multi-modal multi-hop question answering involves answering a question by reasoning over multiple input sources from different modalities.
Existing methods often retrieve evidences separately and then use a language model to generate an answer based on the retrieved evidences.
We propose a Structured Knowledge and Unified Retrieval-Generation (RG) approach to address these issues.
arXiv Detail & Related papers (2022-12-16T18:12:04Z) - Long-Span Dependencies in Transformer-based Summarization Systems [38.672160430296536]
Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization.
One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows.
In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization.
arXiv Detail & Related papers (2021-05-08T23:53:03Z) - Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers.
We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.