Generative Adversarial Transformers
- URL: http://arxiv.org/abs/2103.01209v2
- Date: Tue, 2 Mar 2021 18:39:04 GMT
- Title: Generative Adversarial Transformers
- Authors: Drew A. Hudson and C. Lawrence Zitnick
- Abstract summary: We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling.
The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency.
We show it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency.
- Score: 13.633811200719627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the GANsformer, a novel and efficient type of transformer, and
explore it for the task of visual generative modeling. The network employs a
bipartite structure that enables long-range interactions across the image,
while maintaining computation of linearly efficiency, that can readily scale to
high-resolution synthesis. It iteratively propagates information from a set of
latent variables to the evolving visual features and vice versa, to support the
refinement of each in light of the other and encourage the emergence of
compositional representations of objects and scenes. In contrast to the classic
transformer architecture, it utilizes multiplicative integration that allows
flexible region-based modulation, and can thus be seen as a generalization of
the successful StyleGAN network. We demonstrate the model's strength and
robustness through a careful evaluation over a range of datasets, from
simulated multi-object environments to rich real-world indoor and outdoor
scenes, showing it achieves state-of-the-art results in terms of image quality
and diversity, while enjoying fast learning and better data-efficiency. Further
qualitative and quantitative experiments offer us an insight into the model's
inner workings, revealing improved interpretability and stronger
disentanglement, and illustrating the benefits and efficacy of our approach. An
implementation of the model is available at
https://github.com/dorarad/gansformer.
Related papers
- Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment [20.902935570581207]
We introduce a Multimodal Alignment and Reconstruction Network (MARNet) to enhance the model's resistance to visual noise.
MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains.
Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model.
arXiv Detail & Related papers (2024-07-26T16:30:18Z) - Transformers and Slot Encoding for Sample Efficient Physical World Modelling [1.5498250598583487]
We propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene.
We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples.
arXiv Detail & Related papers (2024-05-30T15:48:04Z) - Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability.
We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images.
Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z) - Flow Factorized Representation Learning [109.51947536586677]
We introduce a generative model which specifies a distinct set of latent probability paths that define different input transformations.
We show that our model achieves higher likelihoods on standard representation learning benchmarks while simultaneously being closer to approximately equivariant models.
arXiv Detail & Related papers (2023-09-22T20:15:37Z) - Multi-Context Dual Hyper-Prior Neural Image Compression [10.349258638494137]
We propose a Transformer-based nonlinear transform to efficiently capture both local and global information from the input image.
We also introduce a novel entropy model that incorporates two different hyperpriors to model cross-channel and spatial dependencies of the latent representation.
Our experiments show that our proposed framework performs better than the state-of-the-art methods in terms of rate-distortion performance.
arXiv Detail & Related papers (2023-09-19T17:44:44Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Improving Sample Efficiency of Value Based Models Using Attention and
Vision Transformers [52.30336730712544]
We introduce a deep reinforcement learning architecture whose purpose is to increase sample efficiency without sacrificing performance.
We propose a visually attentive model that uses transformers to learn a self-attention mechanism on the feature maps of the state representation.
We demonstrate empirically that this architecture improves sample complexity for several Atari environments, while also achieving better performance in some of the games.
arXiv Detail & Related papers (2022-02-01T19:03:03Z) - Multiscale Generative Models: Improving Performance of a Generative
Model Using Feedback from Other Dependent Generative Models [10.053377705165786]
We take a first step towards building interacting generative models (GANs) that reflects the interaction in real world.
We build and analyze a hierarchical set-up where a higher-level GAN is conditioned on the output of multiple lower-level GANs.
We present a technique of using feedback from the higher-level GAN to improve performance of lower-level GANs.
arXiv Detail & Related papers (2022-01-24T13:05:56Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model [58.17021225930069]
We explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA)
We propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly.
Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works.
arXiv Detail & Related papers (2021-05-31T16:20:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.