DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation
- URL: http://arxiv.org/abs/2207.06124v3
- Date: Mon, 27 Mar 2023 07:55:32 GMT
- Title: DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation
- Authors: Songhua Liu, Jingwen Ye, Sucheng Ren, Xinchao Wang
- Abstract summary: We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
- Score: 56.514462874501675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One key challenge of exemplar-guided image generation lies in establishing
fine-grained correspondences between input and guided images. Prior approaches,
despite the promising results, have relied on either estimating dense attention
to compute per-point matching, which is limited to only coarse scales due to
the quadratic memory cost, or fixing the number of correspondences to achieve
linear complexity, which lacks flexibility. In this paper, we propose a dynamic
sparse attention based Transformer model, termed Dynamic Sparse Transformer
(DynaST), to achieve fine-level matching with favorable efficiency. The heart
of our approach is a novel dynamic-attention unit, dedicated to covering the
variation on the optimal number of tokens one position should focus on.
Specifically, DynaST leverages the multi-layer nature of Transformer structure,
and performs the dynamic attention scheme in a cascaded manner to refine
matching results and synthesize visually-pleasing outputs. In addition, we
introduce a unified training objective for DynaST, making it a versatile
reference-based image translation framework for both supervised and
unsupervised scenarios. Extensive experiments on three applications,
pose-guided person image generation, edge-based face synthesis, and undistorted
image style transfer, demonstrate that DynaST achieves superior performance in
local details, outperforming the state of the art while reducing the
computational cost significantly. Our code is available at
https://github.com/Huage001/DynaST
Related papers
- Scalable Visual State Space Model with Fractal Scanning [16.077348474371547]
State Space Models (SSMs) have emerged as efficient alternatives to Transformer models.
We propose using fractal scanning curves for patch serialization.
We validate our method in image classification, detection, and segmentation tasks.
arXiv Detail & Related papers (2024-05-23T12:12:11Z) - DynaSeg: A Deep Dynamic Fusion Method for Unsupervised Image Segmentation Incorporating Feature Similarity and Spatial Continuity [0.5755004576310334]
We introduce DynaSeg, an innovative unsupervised image segmentation approach.
Unlike traditional methods, DynaSeg employs a dynamic weighting scheme.
It adapts flexibly to image characteristics, and facilitates easy integration with other segmentation networks.
arXiv Detail & Related papers (2024-05-09T00:30:45Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - Corner-to-Center Long-range Context Model for Efficient Learned Image
Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations.
We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions.
In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model [58.17021225930069]
We explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA)
We propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly.
Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works.
arXiv Detail & Related papers (2021-05-31T16:20:03Z) - High-Resolution Complex Scene Synthesis with Transformers [6.445605125467574]
coarse-grained synthesis of complex scene images via deep generative models has recently gained popularity.
We present an approach to this task, where the generative model is based on pure likelihood training without additional objectives.
We show that the resulting system is able to synthesize high-quality images consistent with the given layouts.
arXiv Detail & Related papers (2021-05-13T17:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.