Related papers: BERT-VQA: Visual Question Answering on Plots

BERT-VQA: Visual Question Answering on Plots

URL: http://arxiv.org/abs/2508.13184v1
Date: Thu, 14 Aug 2025 00:55:18 GMT
Title: BERT-VQA: Visual Question Answering on Plots
Authors: Tai Vu, Robert Yang,
Abstract summary: We develop BERT-VQA, a VisualBERT-based model architecture with a pretrained ResNet 101 image encoder, along with a potential addition of joint fusion.<n>We trained and evaluated this model against a baseline that consisted of a LSTM, a CNN, and a shallow classifier.<n>The final outcome disproved our core hypothesis that the cross-modality module in VisualBERT is essential in aligning plot components with question phrases.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual question answering has been an exciting challenge in the field of natural language understanding, as it requires deep learning models to exchange information from both vision and language domains. In this project, we aim to tackle a subtask of this problem, namely visual question answering on plots. To achieve this, we developed BERT-VQA, a VisualBERT-based model architecture with a pretrained ResNet 101 image encoder, along with a potential addition of joint fusion. We trained and evaluated this model against a baseline that consisted of a LSTM, a CNN, and a shallow classifier. The final outcome disproved our core hypothesis that the cross-modality module in VisualBERT is essential in aligning plot components with question phrases. Therefore, our work provided valuable insights into the difficulty of the plot question answering challenge as well as the appropriateness of different model architectures in solving this problem.

Related papers

Multimodal Commonsense Knowledge Distillation for Visual Question Answering [12.002744625599425]
We propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.
arXiv Detail & Related papers (2024-11-05T01:37:16Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation [75.1682163844354]
We address the issues of missing modalities that have arisen from the Visual Question Answer-Difference prediction task. We introduce a model, the "Big" Teacher, that takes the image/question/answer triplet as its input and outperforms the baseline.
arXiv Detail & Related papers (2021-04-13T06:41:11Z)
VisBERT: Hidden-State Visualizations for Transformers [66.86452388524886]
We present VisBERT, a tool for visualizing the contextual token representations within BERT for the task of (multi-hop) Question Answering. VisBERT enables users to get insights about the model's internal state and to explore its inference steps or potential shortcomings.
arXiv Detail & Related papers (2020-11-09T15:37:43Z)
Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer. We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z)
Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models. We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z)
Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol. We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z)
FAT ALBERT: Finding Answers in Large Texts using Semantic Similarity Attention Layer based on BERT [0.5772546394254112]
We develop a model based on BERT, a state-of-the-art transformer network. We are ranked first in the leader board with test accuracy of 87.79%.
arXiv Detail & Related papers (2020-08-22T08:04:21Z)
Eliminating Catastrophic Interference with Biased Competition [0.0]
We present a model to take advantage of the multi-task nature of complex datasets by learning to separate tasks and subtasks in and end to end manner by biasing competitive interactions in the network. We demonstrate that this model eliminates catastrophic interference between tasks on a newly created dataset and provides competitive results in the Visual Question Answering space.
arXiv Detail & Related papers (2020-07-03T16:15:15Z)
Robust Question Answering Through Sub-part Alignment [53.94003466761305]
We model question answering as an alignment problem. We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets.
arXiv Detail & Related papers (2020-04-30T09:10:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.