CATCH: A Modular Cross-domain Adaptive Template with Hook
- URL: http://arxiv.org/abs/2510.26582v1
- Date: Thu, 30 Oct 2025 15:10:02 GMT
- Title: CATCH: A Modular Cross-domain Adaptive Template with Hook
- Authors: Xinjin Li, Yulie Lu, Jinghan Cao, Yu Ma, Zhenglin Li, Yeyang Zhou,
- Abstract summary: CATCH is a plug-and-play framework for crossdomain adaptation of Visual Question Answering (VQA) models.<n>Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules.<n>Results show that our framework achieves consistent performance gains without retraining the backbone model.
- Score: 2.869731339311564
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.
Related papers
- Multi-Sensor Matching with HyperNetworks [14.911092205861822]
We leverage hypernetworks to improve multimodal patch matching.<n>We introduce a lightweight descriptor-learning architecture that augments a Siamese CNN.<n>We also release GAP-VIR, a cross-platform (ground/aerial) VIS-IR patch dataset with 500K pairs.
arXiv Detail & Related papers (2026-01-18T09:19:33Z) - DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image.<n> Vision-Language Pre-training models offer a strong open-vocabulary foundation, but struggle with fine-grained localization under weak supervision.<n>We propose the Dual Adaptive Refinement Transfer (DART) framework to overcome these limitations.
arXiv Detail & Related papers (2025-08-07T17:22:33Z) - TransAdapter: Vision Transformer for Feature-Centric Unsupervised Domain Adaptation [0.3277163122167433]
Unsupervised Domain Adaptation (UDA) aims to utilize labeled data from a source domain to solve tasks in an unlabeled target domain.<n>Traditional CNN-based methods struggle to fully capture complex domain relationships.<n>We propose a novel UDA approach leveraging the Swin Transformer with three key modules.
arXiv Detail & Related papers (2024-12-05T11:11:39Z) - APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation [33.90244697752314]
We introduce APSeg, a novel auto-prompt network for cross-domain few-shot semantic segmentation (CD-FSS)
Our model outperforms the state-of-the-art CD-FSS method by 5.24% and 3.10% in average accuracy on 1-shot and 5-shot settings, respectively.
arXiv Detail & Related papers (2024-06-12T16:20:58Z) - CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains.
We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z) - Object-based (yet Class-agnostic) Video Domain Adaptation [78.34712426922519]
We present Object-based (yet Class-agnostic) Video Domain Adaptation (ODAPT)
ODAPT is a simple yet effective framework for adapting the existing action recognition systems to new domains.
Our model achieves a +6.5 increase when adapting across kitchens in Epic-Kitchens and a +3.1 increase adapting between Epic-Kitchens and the EGTEA dataset.
arXiv Detail & Related papers (2023-11-29T01:17:38Z) - Viewpoint Integration and Registration with Vision Language Foundation
Model for Image Change Understanding [15.392243642628387]
We show that existing vision language foundation models (VLFMs) perform poorly when applied directly to image change understanding (ICU)
ICU requires models to capture actual changes between multiple images and describe them in language.
We propose a Viewpoint Integration and Registration method to address these problems.
arXiv Detail & Related papers (2023-09-15T17:41:29Z) - Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics.
We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z) - BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z) - Domain-robust VQA with diverse datasets and methods but no target labels [34.331228652254566]
Domain adaptation for VQA differs from adaptation for object recognition due to additional complexity.
To tackle these challenges, we first quantify domain shifts between popular VQA datasets.
We also construct synthetic shifts in the image and question domains separately.
arXiv Detail & Related papers (2021-03-29T22:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.