OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
Question Answering in Vietnamese
- URL: http://arxiv.org/abs/2305.04183v1
- Date: Sun, 7 May 2023 03:59:31 GMT
- Title: OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
Question Answering in Vietnamese
- Authors: Nghia Hieu Nguyen, Duong T.D. Vo, Kiet Van Nguyen, Ngan Luu-Thuy
Nguyen
- Abstract summary: We introduce the OpenViVQA dataset, the first large-scale dataset for visual question answering in Vietnamese.
The dataset consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs)
Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C.
- Score: 2.7528170226206443
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In recent years, visual question answering (VQA) has attracted attention from
the research community because of its highly potential applications (such as
virtual assistance on intelligent cars, assistant devices for blind people, or
information retrieval from document images using natural language as queries)
and challenge. The VQA task requires methods that have the ability to fuse the
information from questions and images to produce appropriate answers. Neural
visual question answering models have achieved tremendous growth on large-scale
datasets which are mostly for resource-rich languages such as English. However,
available datasets narrow the VQA task as the answers selection task or answer
classification task. We argue that this form of VQA is far from human ability
and eliminates the challenge of the answering aspect in the VQA task by just
selecting answers rather than generating them. In this paper, we introduce the
OpenViVQA (Open-domain Vietnamese Visual Question Answering) dataset, the first
large-scale dataset for VQA with open-ended answers in Vietnamese, consists of
11,000+ images associated with 37,000+ question-answer pairs (QAs). Moreover,
we proposed FST, QuMLAG, and MLPAG which fuse information from images and
answers, then use these fused features to construct answers as humans
iteratively. Our proposed methods achieve results that are competitive with
SOTA models such as SAAA, MCAN, LORA, and M4C. The dataset is available to
encourage the research community to develop more generalized algorithms
including transformers for low-resource languages such as Vietnamese.
Related papers
- Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - Medical Visual Question Answering: A Survey [55.53205317089564]
Medical Visual Question Answering(VQA) is a combination of medical artificial intelligence and popular VQA challenges.
Given a medical image and a clinically relevant question in natural language, the medical VQA system is expected to predict a plausible and convincing answer.
arXiv Detail & Related papers (2021-11-19T05:55:15Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - A survey on VQA_Datasets and Approaches [0.0]
Visual question answering (VQA) is a task that combines the techniques of computer vision and natural language processing.
This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.
arXiv Detail & Related papers (2021-05-02T08:50:30Z) - Generating Natural Questions from Images for Multimodal Assistants [4.930442416763205]
We present an approach for generating diverse and meaningful questions that consider image content and metadata of image.
We evaluate our approach using standard evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr.
arXiv Detail & Related papers (2020-11-17T19:12:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.