ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese
- URL: http://arxiv.org/abs/2310.18046v1
- Date: Fri, 27 Oct 2023 10:44:50 GMT
- Title: ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese
- Authors: Khiem Vinh Tran, Hao Phu Phan, Kiet Van Nguyen, Ngan Luu Thuy Nguyen
- Abstract summary: We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
- Score: 1.6340299456362617
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In recent years, Visual Question Answering (VQA) has gained significant
attention for its diverse applications, including intelligent car assistance,
aiding visually impaired individuals, and document image information retrieval
using natural language queries. VQA requires effective integration of
information from questions and images to generate accurate answers. Neural
models for VQA have made remarkable progress on large-scale datasets, with a
primary focus on resource-rich languages like English. To address this, we
introduce the ViCLEVR dataset, a pioneering collection for evaluating various
visual reasoning capabilities in Vietnamese while mitigating biases. The
dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs),
each question annotated to specify the type of reasoning involved. Leveraging
this dataset, we conduct a comprehensive analysis of contemporary visual
reasoning systems, offering valuable insights into their strengths and
limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion
that identifies objects in images based on questions. The architecture
effectively employs transformers to enable simultaneous reasoning over textual
and visual data, merging both modalities at an early model stage. The
experimental findings demonstrate that our proposed model achieves
state-of-the-art performance across four evaluation metrics. The accompanying
code and dataset have been made publicly accessible at
\url{https://github.com/kvt0012/ViCLEVR}. This provision seeks to stimulate
advancements within the research community, fostering the development of more
multimodal fusion algorithms, specifically tailored to address the nuances of
low-resource languages, exemplified by Vietnamese.
Related papers
- Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration [0.40964539027092917]
This study aims to bridge the gap by conducting experiments on the Vietnamese Visual Question Answering dataset.
We have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system.
Our experimental findings demonstrate that our model surpasses competing baselines, achieving promising performance.
arXiv Detail & Related papers (2024-07-30T22:32:50Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models [36.56689822791777]
Knowledge-Based Visual Question Answering (KBVQA) advances this concept by adding external knowledge along with images to respond to questions.
Our main contribution involves enhancing questions by incorporating relevant external knowledge extracted from knowledge graphs, using a dynamic triple extraction method.
Our model, enriched with knowledge, demonstrates an average improvement of 4.75% in Exact Match Score over the state-of-the-art on three different KBVQA datasets.
arXiv Detail & Related papers (2024-06-14T13:07:46Z) - Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z) - OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
Question Answering in Vietnamese [2.7528170226206443]
We introduce the OpenViVQA dataset, the first large-scale dataset for visual question answering in Vietnamese.
The dataset consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs)
Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C.
arXiv Detail & Related papers (2023-05-07T03:59:31Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.