Object Attribute Matters in Visual Question Answering
- URL: http://arxiv.org/abs/2401.09442v1
- Date: Wed, 20 Dec 2023 12:46:30 GMT
- Title: Object Attribute Matters in Visual Question Answering
- Authors: Peize Li, Qingyi Si, Peng Fu, Zheng Lin, Yan Wang
- Abstract summary: We propose a novel VQA approach from the perspective of utilizing object attribute.
The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing.
The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness.
- Score: 15.705504296316576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual question answering is a multimodal task that requires the joint
comprehension of visual and textual information. However, integrating visual
and textual semantics solely through attention layers is insufficient to
comprehensively understand and align information from both modalities.
Intuitively, object attributes can naturally serve as a bridge to unify them,
which has been overlooked in previous research. In this paper, we propose a
novel VQA approach from the perspective of utilizing object attribute, aiming
to achieve better object-level visual-language alignment and multimodal scene
understanding. Specifically, we design an attribute fusion module and a
contrastive knowledge distillation module. The attribute fusion module
constructs a multimodal graph neural network to fuse attributes and visual
features through message passing. The enhanced object-level visual features
contribute to solving fine-grained problem like counting-question. The better
object-level visual-language alignment aids in understanding multimodal scenes,
thereby improving the model's robustness. Furthermore, to augment scene
understanding and the out-of-distribution performance, the contrastive
knowledge distillation module introduces a series of implicit knowledge. We
distill knowledge into attributes through contrastive loss, which further
strengthens the representation learning of attribute features and facilitates
visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA,
VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the
proposed method.
Related papers
- PSVMA+: Exploring Multi-granularity Semantic-visual Adaption for Generalized Zero-shot Learning [116.33775552866476]
Generalized zero-shot learning (GZSL) endeavors to identify the unseen using knowledge from the seen domain.
GZSL suffers from insufficient visual-semantic correspondences due to attribute diversity and instance diversity.
We propose a multi-granularity progressive semantic-visual adaption network, where sufficient visual elements can be gathered to remedy the inconsistency.
arXiv Detail & Related papers (2024-10-15T12:49:33Z) - KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language.
We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features.
Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z) - LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition [17.388776062997813]
We try to build a discriminative global representations by fusing image data and text descriptions of the the visual scene.
The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images.
Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion.
arXiv Detail & Related papers (2024-07-09T10:15:31Z) - Multi-modal Attribute Prompting for Vision-Language Models [40.39559705414497]
Pre-trained Vision-Language Models (VLMs) exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios.
Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics.
We propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment.
arXiv Detail & Related papers (2024-03-01T01:28:10Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Bear the Query in Mind: Visual Grounding with Query-conditioned
Convolution [26.523051615516742]
We propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels.
Our method achieves state-of-the-art performance on three popular visual grounding datasets.
arXiv Detail & Related papers (2022-06-18T04:26:39Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.