Regularizing Attention Networks for Anomaly Detection in Visual Question
Answering
- URL: http://arxiv.org/abs/2009.10054v3
- Date: Mon, 11 Apr 2022 16:33:14 GMT
- Title: Regularizing Attention Networks for Anomaly Detection in Visual Question
Answering
- Authors: Doyup Lee, Yeongjae Cheon, Wook-Shin Han
- Abstract summary: We evaluate the robustness of state-of-the-art VQA models to five different anomalies.
We propose an attention-based method, which uses confidence of reasoning between input images and questions.
We show that a maximum entropy regularization of attention networks can significantly improve the attention-based anomaly detection.
- Score: 10.971443035470488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For stability and reliability of real-world applications, the robustness of
DNNs in unimodal tasks has been evaluated. However, few studies consider
abnormal situations that a visual question answering (VQA) model might
encounter at test time after deployment in the real-world. In this study, we
evaluate the robustness of state-of-the-art VQA models to five different
anomalies, including worst-case scenarios, the most frequent scenarios, and the
current limitation of VQA models. Different from the results in unimodal tasks,
the maximum confidence of answers in VQA models cannot detect anomalous inputs,
and post-training of the outputs, such as outlier exposure, is ineffective for
VQA models. Thus, we propose an attention-based method, which uses confidence
of reasoning between input images and questions and shows much more promising
results than the previous methods in unimodal tasks. In addition, we show that
a maximum entropy regularization of attention networks can significantly
improve the attention-based anomaly detection of the VQA models. Thanks to the
simplicity, attention-based anomaly detection and the regularization are
model-agnostic methods, which can be used for various cross-modal attentions in
the state-of-the-art VQA models. The results imply that cross-modal attention
in VQA is important to improve not only VQA accuracy, but also the robustness
to various anomalies.
Related papers
- Visual Robustness Benchmark for Visual Question Answering (VQA) [0.08246494848934446]
We propose the first large-scale benchmark comprising 213,000 augmented images.
We challenge the visual robustness of multiple VQA models and assess the strength of realistic visual corruptions.
arXiv Detail & Related papers (2024-07-03T08:35:03Z) - Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal.
Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos.
This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z) - Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - Improving Visual Question Answering Models through Robustness Analysis
and In-Context Learning with a Chain of Basic Questions [70.70725223310401]
This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models.
The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models.
arXiv Detail & Related papers (2023-04-06T15:32:35Z) - Knowledge-Based Counterfactual Queries for Visual Question Answering [0.0]
We propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations.
For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality.
We then evaluate the model's response against such counterfactual inputs.
arXiv Detail & Related papers (2023-03-05T08:00:30Z) - Reliable Visual Question Answering: Abstain Rather Than Answer
Incorrectly [100.60560477391732]
We promote a problem formulation for reliable visual question answering (VQA)
We analyze both their coverage, the portion of questions answered, and risk, the error on that portion.
We find that although the best performing models achieve over 71% accuracy on the VQA v2 dataset, introducing the option to abstain limits them to answering less than 8% of the questions to achieve a low risk of error (i.e., 1%)
This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can triple the coverage from, for example, 5.0% to 16.7% at
arXiv Detail & Related papers (2022-04-28T16:51:27Z) - COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images.
In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.