Solving Reasoning Tasks with a Slot Transformer
- URL: http://arxiv.org/abs/2210.11394v1
- Date: Thu, 20 Oct 2022 16:40:30 GMT
- Title: Solving Reasoning Tasks with a Slot Transformer
- Authors: Ryan Faulkner, Daniel Zoran
- Abstract summary: We present the Slot Transformer, an architecture that leverages slot attention, transformers and iterative variational inference on video scene data to infer representations.
We evaluate the effectiveness of key components of the architecture, the model's representational capacity and its ability to predict from incomplete input.
- Score: 7.966351917016229
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to carve the world into useful abstractions in order to reason
about time and space is a crucial component of intelligence. In order to
successfully perceive and act effectively using senses we must parse and
compress large amounts of information for further downstream reasoning to take
place, allowing increasingly complex concepts to emerge. If there is any hope
to scale representation learning methods to work with real world scenes and
temporal dynamics then there must be a way to learn accurate, concise, and
composable abstractions across time. We present the Slot Transformer, an
architecture that leverages slot attention, transformers and iterative
variational inference on video scene data to infer such representations. We
evaluate the Slot Transformer on CLEVRER, Kinetics-600 and CATER datesets and
demonstrate that the approach allows us to develop robust modeling and
reasoning around complex behaviours as well as scores on these datasets that
compare favourably to existing baselines. Finally we evaluate the effectiveness
of key components of the architecture, the model's representational capacity
and its ability to predict from incomplete input.
Related papers
- Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - Reasoning-Enhanced Object-Centric Learning for Videos [15.554898985821302]
We develop a Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes.
Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.
arXiv Detail & Related papers (2024-03-22T14:41:55Z) - Emergence and Function of Abstract Representations in Self-Supervised
Transformers [0.0]
We study the inner workings of small-scale transformers trained to reconstruct partially masked visual scenes.
We show that the network develops intermediate abstract representations, or abstractions, that encode all semantic features of the dataset.
Using precise manipulation experiments, we demonstrate that abstractions are central to the network's decision-making process.
arXiv Detail & Related papers (2023-12-08T20:47:15Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Robust and Controllable Object-Centric Learning through Energy-based
Models [95.68748828339059]
ours is a conceptually simple and general approach to learning object-centric representations through an energy-based model.
We show that ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations.
arXiv Detail & Related papers (2022-10-11T15:11:15Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - Dynamic Spatial Sparsification for Efficient Vision Transformers and
Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data.
We propose a dynamic token sparsification framework to prune redundant tokens.
We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z) - Attention-based Adversarial Appearance Learning of Augmented Pedestrians [49.25430012369125]
We propose a method to synthesize realistic data for the pedestrian recognition task.
Our approach utilizes an attention mechanism driven by an adversarial loss to learn domain discrepancies.
Our experiments confirm that the proposed adaptation method is robust to such discrepancies and reveals both visual realism and semantic consistency.
arXiv Detail & Related papers (2021-07-06T15:27:00Z) - Generative Adversarial Transformers [13.633811200719627]
We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling.
The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency.
We show it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency.
arXiv Detail & Related papers (2021-03-01T18:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.