Self-Segregating and Coordinated-Segregating Transformer for Focused
Deep Multi-Modular Network for Visual Question Answering
- URL: http://arxiv.org/abs/2006.14264v1
- Date: Thu, 25 Jun 2020 09:17:03 GMT
- Title: Self-Segregating and Coordinated-Segregating Transformer for Focused
Deep Multi-Modular Network for Visual Question Answering
- Authors: Chiranjib Sur
- Abstract summary: We define segregating strategies that can prioritize the contents for the applications for enhancement of performance.
We defined two strategies: Self-Segregating Transformer (SST) and Coordinated-Segregating Transformer (CST)
This work can easily be used in many other applications that involve repetition and multiple frames of features.
- Score: 9.89901717499058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention mechanism has gained huge popularity due to its effectiveness in
achieving high accuracy in different domains. But attention is opportunistic
and is not justified by the content or usability of the content. Transformer
like structure creates all/any possible attention(s). We define segregating
strategies that can prioritize the contents for the applications for
enhancement of performance. We defined two strategies: Self-Segregating
Transformer (SST) and Coordinated-Segregating Transformer (CST) and used it to
solve visual question answering application. Self-segregation strategy for
attention contributes in better understanding and filtering the information
that can be most helpful for answering the question and create diversity of
visual-reasoning for attention. This work can easily be used in many other
applications that involve repetition and multiple frames of features and would
reduce the commonality of the attentions to a great extent. Visual Question
Answering (VQA) requires understanding and coordination of both images and
textual interpretations. Experiments demonstrate that segregation strategies
for cascaded multi-head transformer attention outperforms many previous works
and achieved considerable improvement for VQA-v2 dataset benchmark.
Related papers
- PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification [73.64560354556498]
Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features.
We present PartFormer, an innovative adaptation of ViT designed to overcome the limitations in object Re-ID tasks.
Our framework significantly outperforms state-of-the-art by 2.4% mAP scores on the most challenging MSMT17 dataset.
arXiv Detail & Related papers (2024-08-29T16:31:05Z) - Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching.
We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations.
Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
arXiv Detail & Related papers (2023-03-23T15:42:05Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - On the Efficacy of Co-Attention Transformer Layers in Visual Question
Answering [5.547800834335381]
We investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question.
We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers.
Our work sheds light on the function and interpretation of co-attention transformer layers, highlights gaps in current networks, and can guide the development of future VQA models.
arXiv Detail & Related papers (2022-01-11T14:25:17Z) - Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants.
Standard attention heads learn a rigid mapping between search and retrieval.
We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z) - Beyond Self-attention: External Attention using Two Linear Layers for
Visual Tasks [34.32609892928909]
We propose a novel attention mechanism which we call external attention, based on two external, small, learnable, and shared memories.
Our method provides comparable or superior performance to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
arXiv Detail & Related papers (2021-05-05T22:29:52Z) - Spatially Aware Multimodal Transformers for TextVQA [61.01618988620582]
We study the TextVQA task, i.e., reasoning about text in images to answer a question.
Existing approaches are limited in their use of spatial relations.
We propose a novel spatially aware self-attention layer.
arXiv Detail & Related papers (2020-07-23T17:20:55Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z) - SACT: Self-Aware Multi-Space Feature Composition Transformer for
Multinomial Attention for Video Captioning [9.89901717499058]
As the feature length increases, it becomes increasingly important to include provisions for improved capturing of the pertinent contents.
In this work, we have introduced a new concept of Self-Aware Composition Transformer (SACT) that is capable of generating Multinomial Attention (MultAtt)
We propose the Self-Aware Composition Transformer model for dense video captioning and apply the technique on two benchmark datasets like ActivityNet and YouCookII.
arXiv Detail & Related papers (2020-06-25T09:11:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.