Visual Concept Reasoning Networks
- URL: http://arxiv.org/abs/2008.11783v1
- Date: Wed, 26 Aug 2020 20:02:40 GMT
- Title: Visual Concept Reasoning Networks
- Authors: Taesup Kim, Sungwoong Kim, Yoshua Bengio
- Abstract summary: A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks.
We propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts.
Our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.
- Score: 93.99840807973546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A split-transform-merge strategy has been broadly used as an architectural
constraint in convolutional neural networks for visual recognition tasks. It
approximates sparsely connected networks by explicitly defining multiple
branches to simultaneously learn representations with different visual concepts
or properties. Dependencies or interactions between these representations are
typically defined by dense and local operations, however, without any
adaptiveness or high-level reasoning. In this work, we propose to exploit this
strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to
enable reasoning between high-level visual concepts. We associate each branch
with a visual concept and derive a compact concept state by selecting a few
local descriptors through an attention module. These concept states are then
updated by graph-based interaction and used to adaptively modulate the local
descriptors. We describe our proposed model by
split-transform-attend-interact-modulate-merge stages, which are implemented by
opting for a highly modularized architecture. Extensive experiments on visual
recognition tasks such as image classification, semantic segmentation, object
detection, scene recognition, and action recognition show that our proposed
model, VCRNet, consistently improves the performance by increasing the number
of parameters by less than 1%.
Related papers
- Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - Concept-Centric Transformers: Enhancing Model Interpretability through
Object-Centric Concept Learning within a Shared Global Workspace [1.6574413179773757]
Concept-Centric Transformers is a simple yet effective configuration of the shared global workspace for interpretability.
We show that our model achieves better classification accuracy than all baselines across all problems.
arXiv Detail & Related papers (2023-05-25T06:37:39Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - ViGAT: Bottom-up event recognition and explanation in video using
factorized graph attention network [8.395400675921515]
ViGAT is a pure-attention bottom-up approach to derive object and frame features.
A head network is proposed to process these features for the task of event recognition and explanation in video.
A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets.
arXiv Detail & Related papers (2022-07-20T14:12:05Z) - Cross-Modal Discrete Representation Learning [73.68393416984618]
We present a self-supervised learning framework that learns a representation that captures finer levels of granularity across different modalities.
Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities.
arXiv Detail & Related papers (2021-06-10T00:23:33Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z) - Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task.
We combine the resulting representations with global scene information for accurately predicting visual saliency.
Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.