Related papers: Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

URL: http://arxiv.org/abs/2502.04469v2
Date: Sun, 27 Jul 2025 07:10:01 GMT
Title: Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering
Authors: Imad Eddine Marouf, Enzo Tartaglione, Stephane Lathuiliere, Joost van de Weijer,
Abstract summary: Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability)<n>Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement.<n>We present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization.
Score: 17.369734751262126
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability). The inherent multimodality of VQACL exacerbates this challenge, as models must balance stability across visual and textual domains while adapting to novel objects and reasoning tasks. Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement. In this work, we present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization. By eliminating the need to store visual data, QUAD not only reduces memory overhead, but also alleviates privacy concerns. Our method introduces a Question-only Replay mechanism that selectively reuses prior task questions to counteract overfitting to the answer space of the current task, addressing the problem out of answer set. Complementing this, we propose Attention Consistency Distillation to enforce both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA. Code is available at: https://github.com/IemProg/QUAD.

Related papers

LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering [10.060267989615813]
We introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding.<n> Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method's precise visual grounding substantially enhances the understanding of video-question relationships.
arXiv Detail & Related papers (2025-07-20T01:57:00Z)
QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View [2.3982875575861677]
We present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge. For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image. For training, we present an action dictionary-guided design, which consistently yields the most favorable results.
arXiv Detail & Related papers (2024-07-18T06:55:26Z)
Continual Learning for Temporal-Sensitive Question Answering [12.76582814745124]
In real-world applications, it's crucial for models to continually acquire knowledge over time, rather than relying on a static, complete dataset. Our paper investigates strategies that enable models to adapt to the ever-evolving information landscape. We propose a training framework for CLTSQA that integrates temporal memory replay and temporal contrastive learning.
arXiv Detail & Related papers (2024-07-17T10:47:43Z)
Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering. We show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning [1.8399318639816038]
We propose prioritized soft Q-decomposition (PSQD) for learning and adapting subtask solutions under lexicographic priorities. PSQD offers the ability to reuse previously learned subtask solutions in a zero-shot composition, followed by an adaptation step. We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both low- and high-dimensional simulated robot control tasks.
arXiv Detail & Related papers (2023-10-03T18:36:21Z)
Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization [51.34904967046097]
Continual learning seeks to overcome the challenge of catastrophic forgetting, where a model forgets previously learnt information. We introduce a novel prior-based method that better constrains parameter growth, reducing catastrophic forgetting. Results show that BAdam achieves state-of-the-art performance for prior-based methods on challenging single-headed class-incremental experiments.
arXiv Detail & Related papers (2023-09-15T17:10:51Z)
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering [7.640416680391081]
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. To mitigate the challenges associated with evaluating free-form open-ended VQA responses, we introduce a straightforward LLM-guided pre-processing technique.
arXiv Detail & Related papers (2023-06-16T17:47:57Z)
Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions [70.70725223310401]
This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models. The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models.
arXiv Detail & Related papers (2023-04-06T15:32:35Z)
SC-ML: Self-supervised Counterfactual Metric Learning for Debiased Visual Question Answering [10.749155815447127]
We propose a self-supervised counterfactual metric learning (SC-ML) method to focus the image features better. SC-ML can adaptively select the question-relevant visual features to answer the question, reducing the negative influence of question-irrelevant visual features on inferring answers.
arXiv Detail & Related papers (2023-04-04T09:05:11Z)
Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning [60.501201259732625]
We introduce task-adaptive saliency for EFCIL and propose a new framework, which we call Task-Adaptive Saliency Supervision (TASS) Our experiments demonstrate that our method can better preserve saliency maps across tasks and achieve state-of-the-art results on the CIFAR-100, Tiny-ImageNet, and ImageNet-Subset EFCIL benchmarks.
arXiv Detail & Related papers (2022-12-16T02:43:52Z)
Locate before Answering: Answer Guided Question Localization for Video Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model. It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z)
Continual VQA for Disaster Response Systems [0.0]
Visual Question Answering (VQA) is a multi-modal task that involves answering questions from an input image. Main challenge is the delay caused by the generation of labels in the assessment of the affected areas. We deploy pre-trained CLIP model, which is trained on visual-image pairs. We surpass previous state-of-the-art results on the FloodNet dataset.
arXiv Detail & Related papers (2022-09-21T12:45:51Z)
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task [12.74065821307626]
VQA is an ambitious task aiming to answer any image-related question. It is hard to build such a system once for all since the needs of users are continuously updated. We propose a real-data-free replay-based method tailored for CL on VQA, named Scene Graph as Prompt for Replay.
arXiv Detail & Related papers (2022-08-24T12:00:02Z)
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently. We propose a visual capsule module with a query-based selection mechanism of capsule features. We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z)
Regularizing Attention Networks for Anomaly Detection in Visual Question Answering [10.971443035470488]
We evaluate the robustness of state-of-the-art VQA models to five different anomalies. We propose an attention-based method, which uses confidence of reasoning between input images and questions. We show that a maximum entropy regularization of attention networks can significantly improve the attention-based anomaly detection.
arXiv Detail & Related papers (2020-09-21T17:47:49Z)
SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems. To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT) We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.