Transformer Module Networks for Systematic Generalization in Visual
Question Answering
- URL: http://arxiv.org/abs/2201.11316v1
- Date: Thu, 27 Jan 2022 04:22:25 GMT
- Title: Transformer Module Networks for Systematic Generalization in Visual
Question Answering
- Authors: Moyuru Yamada, Vanessa D'Amario, Kentaro Takemoto, Xavier Boix, and
Tomotake Sasaki
- Abstract summary: Transformer Module Network (TMN) dynamically composes modules into a question-specific Transformer network.
TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets.
- Score: 4.169829151981242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based models achieve great performance on Visual Question
Answering (VQA). However, when we evaluate them on systematic generalization,
i.e., handling novel combinations of known concepts, their performance
degrades. Neural Module Networks (NMNs) are a promising approach for systematic
generalization that consists on composing modules, i.e., neural networks that
tackle a sub-task. Inspired by Transformers and NMNs, we propose Transformer
Module Network (TMN), a novel Transformer-based model for VQA that dynamically
composes modules into a question-specific Transformer network. TMNs achieve
state-of-the-art systematic generalization performance in three VQA datasets,
namely, CLEVR-CoGenT, CLOSURE and GQA-SGL, in some cases improving more than
30% over standard Transformers.
Related papers
- Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution [6.857919231112562]
Window-based transformers have demonstrated outstanding performance in super-resolution tasks.
They exhibit higher computational complexity and inference latency than convolutional neural networks.
We construct a convolution-based Transformer framework named the linear adaptive mixer network (LAMNet)
arXiv Detail & Related papers (2024-09-26T07:24:09Z) - Breaking Neural Network Scaling Laws with Modularity [8.482423139660153]
We show how the amount of training data required to generalize varies with the intrinsic dimensionality of a task's input.
We then develop a novel learning rule for modular networks to exploit this advantage.
arXiv Detail & Related papers (2024-09-09T16:43:09Z) - NAR-Former V2: Rethinking Transformer for Universal Neural Network
Representation Learning [25.197394237526865]
We propose a modified Transformer-based universal neural network representation learning model NAR-Former V2.
Specifically, we take the network as a graph and design a straightforward tokenizer to encode the network into a sequence.
We incorporate the inductive representation learning capability of GNN into Transformer, enabling Transformer to generalize better when encountering unseen architecture.
arXiv Detail & Related papers (2023-06-19T09:11:04Z) - Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics.
We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - How Modular Should Neural Module Networks Be for Systematic
Generalization? [4.533408938245526]
NMNs aim at Visual Question Answering (VQA) via composition of modules that tackle a sub-task.
In this paper, we demonstrate that the stage and the degree at which modularity is defined has large influence on systematic generalization.
arXiv Detail & Related papers (2021-06-15T14:13:47Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - RE-MIMO: Recurrent and Permutation Equivariant Neural MIMO Detection [85.44877328116881]
We present a novel neural network for symbol detection in wireless communication systems.
It is motivated by several important considerations in wireless communication systems.
We compare its performance against existing methods and the results show the ability of our network to efficiently handle a variable number of transmitters.
arXiv Detail & Related papers (2020-06-30T22:43:01Z) - Variational Transformers for Diverse Response Generation [71.53159402053392]
Variational Transformer (VT) is a variational self-attentive feed-forward sequence model.
VT combines the parallelizability and global receptive field computation of the Transformer with the variational nature of the CVAE.
We explore two types of VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables.
arXiv Detail & Related papers (2020-03-28T07:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.