Semantic-aware Modular Capsule Routing for Visual Question Answering
- URL: http://arxiv.org/abs/2207.10404v1
- Date: Thu, 21 Jul 2022 10:48:37 GMT
- Title: Semantic-aware Modular Capsule Routing for Visual Question Answering
- Authors: Yudong Han, Jianhua Yin, Jianlong Wu, Yinwei Wei, Liqiang Nie
- Abstract summary: We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics.
We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
- Score: 55.03883681191765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) is fundamentally compositional in nature, and
many questions are simply answered by decomposing them into modular
sub-problems. The recent proposed Neural Module Network (NMN) employ this
strategy to question answering, whereas heavily rest with off-the-shelf layout
parser or additional expert policy regarding the network architecture design
instead of learning from the data. These strategies result in the
unsatisfactory adaptability to the semantically-complicated variance of the
inputs, thereby hindering the representational capacity and generalizability of
the model. To tackle this problem, we propose a Semantic-aware modUlar caPsulE
Routing framework, termed as SUPER, to better capture the instance-specific
vision-semantic characteristics and refine the discriminative representations
for prediction. Particularly, five powerful specialized modules as well as
dynamic routers are tailored in each layer of the SUPER network, and the
compact routing spaces are constructed such that a variety of customizable
routes can be sufficiently exploited and the vision-semantic representations
can be explicitly calibrated. We comparatively justify the effectiveness and
generalization ability of our proposed SUPER scheme over five benchmark
datasets, as well as the parametric-efficient advantage. It is worth
emphasizing that this work is not to pursue the state-of-the-art results in
VQA. Instead, we expect that our model is responsible to provide a novel
perspective towards architecture learning and representation calibration for
VQA.
Related papers
- Breaking Neural Network Scaling Laws with Modularity [8.482423139660153]
We show how the amount of training data required to generalize varies with the intrinsic dimensionality of a task's input.
We then develop a novel learning rule for modular networks to exploit this advantage.
arXiv Detail & Related papers (2024-09-09T16:43:09Z) - Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales [54.78115855552886]
We show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture.
With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner.
For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.
arXiv Detail & Related papers (2024-02-23T16:50:07Z) - SeqTR: A Simple yet Universal Network for Visual Grounding [88.03253818868204]
We propose a simple yet universal network termed SeqTR for visual grounding tasks.
We cast visual grounding as a point prediction problem conditioned on image and text inputs.
Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads.
arXiv Detail & Related papers (2022-03-30T12:52:46Z) - Build a Robust QA System with Transformer-based Mixture of Experts [0.29005223064604074]
We build a robust question answering system that can adapt to out-of-domain datasets.
We show that our combination of best architecture and data augmentation techniques achieves a 53.477 F1 score in the out-of-domain evaluation.
arXiv Detail & Related papers (2022-03-20T02:38:29Z) - Neural combinatorial optimization beyond the TSP: Existing architectures
under-represent graph structure [9.673093148930876]
We analyze how and whether recent neural architectures can be applied to graph problems of practical importance.
We show that augmenting the structural representation of problems with Distance is a promising step towards the still-ambitious goal of learning multi-purpose autonomous solvers.
arXiv Detail & Related papers (2022-01-03T14:14:28Z) - Combining Discrete Choice Models and Neural Networks through Embeddings:
Formulation, Interpretability and Performance [10.57079240576682]
This study proposes a novel approach that combines theory and data-driven choice models using Artificial Neural Networks (ANNs)
In particular, we use continuous vector representations, called embeddings, for encoding categorical or discrete explanatory variables.
Our models deliver state-of-the-art predictive performance, outperforming existing ANN-based models while drastically reducing the number of required network parameters.
arXiv Detail & Related papers (2021-09-24T15:55:31Z) - Learning Deep Interleaved Networks with Asymmetric Co-Attention for
Image Restoration [65.11022516031463]
We present a deep interleaved network (DIN) that learns how information at different states should be combined for high-quality (HQ) images reconstruction.
In this paper, we propose asymmetric co-attention (AsyCA) which is attached at each interleaved node to model the feature dependencies.
Our presented DIN can be trained end-to-end and applied to various image restoration tasks.
arXiv Detail & Related papers (2020-10-29T15:32:00Z) - Obtaining Faithful Interpretations from Compositional Neural Networks [72.41100663462191]
We evaluate the intermediate outputs of NMNs on NLVR2 and DROP datasets.
We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour.
arXiv Detail & Related papers (2020-05-02T06:50:35Z) - On Infinite-Width Hypernetworks [101.03630454105621]
We show that hypernetworks do not guarantee to a global minima under descent.
We identify the functional priors of these architectures by deriving their corresponding GP and NTK kernels.
As part of this study, we make a mathematical contribution by deriving tight bounds on high order Taylor terms of standard fully connected ReLU networks.
arXiv Detail & Related papers (2020-03-27T00:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.