Related papers: Multi-Sourced Compositional Generalization in Visual Question Answering

Multi-Sourced Compositional Generalization in Visual Question Answering

URL: http://arxiv.org/abs/2505.23045v1
Date: Thu, 29 May 2025 03:41:36 GMT
Title: Multi-Sourced Compositional Generalization in Visual Question Answering
Authors: Chuanhao Li, Wenbo Ye, Zhen Li, Yuwei Wu, Yunde Jia,
Abstract summary: We propose a retrieval-augmented training framework to enhance MSCG ability of visual question answering (VQA) models.<n>We construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities.
Score: 31.47252795543269
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. Experimental results demonstrate the effectiveness of the proposed framework. We release GQA-MSCG at https://github.com/NeverMoreLCH/MSCG.

Related papers

A Benchmark Dataset for Graph Regression with Homogeneous and Multi-Relational Variants [3.037387520023979]
We introduce RelSC, a new graph-regression dataset built from program graphs.<n>Each graph is labelled with the execution-time cost of the corresponding program.<n>We evaluate a diverse set of graph neural network architectures on both variants of RelSC.
arXiv Detail & Related papers (2025-05-29T12:59:36Z)
Consistency of Compositional Generalization across Multiple Levels [31.77432446850103]
We propose a meta-learning based framework, for achieving consistent compositional generalization across multiple levels.<n>We build a GQA-CCG dataset to quantitatively evaluate the consistency.
arXiv Detail & Related papers (2024-12-18T09:09:41Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot Learning [54.08741382593959]
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL)<n>It is challenging to learn disentangled primitive features that are general across different compositions.<n>We propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs.
arXiv Detail & Related papers (2024-08-19T08:23:09Z)
A Simple Recipe for Language-guided Domain Generalized Segmentation [45.93202559299953]
Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. We introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: (i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, (ii) language-driven local style augmentation, and (iii) randomization by locally mixing the source and augmented styles during training.
arXiv Detail & Related papers (2023-11-29T18:59:59Z)
Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models [55.5610165938949]
Fine-tuning vision-language models (VLMs) has gained increasing popularity due to its practical value. This paper explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. We introduce three customized ensemble strategies, each tailored to one specific scenario. The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-28T05:17:25Z)
Style-Hallucinated Dual Consistency Learning: A Unified Framework for Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks. Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z)
Improving the Sample-Complexity of Deep Classification Networks with Invariant Integration [77.99182201815763]
Leveraging prior knowledge on intraclass variance due to transformations is a powerful method to improve the sample complexity of deep neural networks. We propose a novel monomial selection algorithm based on pruning methods to allow an application to more complex problems. We demonstrate the improved sample complexity on the Rotated-MNIST, SVHN and CIFAR-10 datasets.
arXiv Detail & Related papers (2022-02-08T16:16:11Z)
Neural Entity Linking: A Survey of Models Based on Deep Learning [82.43751915717225]
This survey presents a comprehensive description of recent neural entity linking (EL) systems developed since 2015. Its goal is to systemize design features of neural entity linking systems and compare their performance to the remarkable classic methods on common benchmarks. The survey touches on applications of entity linking, focusing on the recently emerged use-case of enhancing deep pre-trained masked language models.
arXiv Detail & Related papers (2020-05-31T18:02:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.