BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers
Models for Vietnamese Visual Question Answering
- URL: http://arxiv.org/abs/2307.15335v1
- Date: Fri, 28 Jul 2023 06:23:32 GMT
- Title: BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers
Models for Vietnamese Visual Question Answering
- Authors: Khiem Vinh Tran and Kiet Van Nguyen and Ngan Luu Thuy Nguyen
- Abstract summary: Visual Question Answering (VQA) is an intricate and demanding task that integrates natural language processing (NLP) and computer vision (CV)
We introduce a transformer-based Vietnamese model named BARTPhoBEiT.
This model includes pre-trained Sequence-to-Sequence and bidirectional encoder representation from Image Transformers in Vietnamese and evaluates Vietnamese VQA datasets.
- Score: 3.0938904602244355
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Visual Question Answering (VQA) is an intricate and demanding task that
integrates natural language processing (NLP) and computer vision (CV),
capturing the interest of researchers. The English language, renowned for its
wealth of resources, has witnessed notable advancements in both datasets and
models designed for VQA. However, there is a lack of models that target
specific countries such as Vietnam. To address this limitation, we introduce a
transformer-based Vietnamese model named BARTPhoBEiT. This model includes
pre-trained Sequence-to-Sequence and bidirectional encoder representation from
Image Transformers in Vietnamese and evaluates Vietnamese VQA datasets.
Experimental results demonstrate that our proposed model outperforms the strong
baseline and improves the state-of-the-art in six metrics: Accuracy, Precision,
Recall, F1-score, WUPS 0.0, and WUPS 0.9.
Related papers
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing.
Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens.
The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z) - Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese [0.0]
Vintern-1B is a reliable multimodal large language model (MLLM) for Vietnamese language tasks.
The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs.
Vintern-1B is small enough to fit into various on-device applications easily.
arXiv Detail & Related papers (2024-08-22T15:15:51Z) - Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration [0.40964539027092917]
This study aims to bridge the gap by conducting experiments on the Vietnamese Visual Question Answering dataset.
We have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system.
Our experimental findings demonstrate that our model surpasses competing baselines, achieving promising performance.
arXiv Detail & Related papers (2024-07-30T22:32:50Z) - ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model [0.0]
We introduce ViHateT5, a T5-based model pre-trained on our proposed large-scale domain-specific dataset named VOZ-HSD.
By harnessing the power of a text-to-text architecture, ViHateT5 can tackle multiple tasks using a unified model and achieve state-of-the-art performance across all standard HSD benchmarks in Vietnamese.
arXiv Detail & Related papers (2024-05-23T03:31:50Z) - Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Generative Pre-trained Transformer for Vietnamese Community-based
COVID-19 Question Answering [0.0]
Generative Pre-trained Transformer (GPT) has been effectively employed as a decoder within state-of-the-art (SOTA) question answering systems.
This paper presents an implementation of GPT-2 for community-based question answering specifically focused on COVID-19 related queries in Vietnamese.
arXiv Detail & Related papers (2023-10-23T06:14:07Z) - Visual Question Generation in Bengali [0.0]
We develop a novel transformer-based encoder-decoder architecture that generates questions in Bengali when given an image.
We establish the first state of the art models for Visual Question Generation task in Bengali.
Our results show that our image-cat model achieves a BLUE-1 score of 33.12 and BLEU-3 score of 7.56.
arXiv Detail & Related papers (2023-10-12T10:26:26Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - Counterfactual Samples Synthesizing and Training for Robust Visual
Question Answering [59.20766562530209]
VQA models still tend to capture superficial linguistic correlations in the training set.
Recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA models.
We propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy.
arXiv Detail & Related papers (2021-10-03T14:31:46Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.