ASSET: Autoregressive Semantic Scene Editing with Transformers at High
Resolutions
- URL: http://arxiv.org/abs/2205.12231v1
- Date: Tue, 24 May 2022 17:39:53 GMT
- Title: ASSET: Autoregressive Semantic Scene Editing with Transformers at High
Resolutions
- Authors: Difan Liu, Sandesh Shetty, Tobias Hinz, Matthew Fisher, Richard Zhang,
Taesung Park, Evangelos Kalogerakis
- Abstract summary: Our architecture is based on a transformer with a novel attention mechanism.
Our key idea is to sparsify the transformer's attention matrix at high resolutions, guided by dense attention extracted at lower image resolutions.
We present qualitative and quantitative results, along with user studies, demonstrating the effectiveness of our method.
- Score: 28.956280590967808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present ASSET, a neural architecture for automatically modifying an input
high-resolution image according to a user's edits on its semantic segmentation
map. Our architecture is based on a transformer with a novel attention
mechanism. Our key idea is to sparsify the transformer's attention matrix at
high resolutions, guided by dense attention extracted at lower image
resolutions. While previous attention mechanisms are computationally too
expensive for handling high-resolution images or are overly constrained within
specific image regions hampering long-range interactions, our novel attention
mechanism is both computationally efficient and effective. Our sparsified
attention mechanism is able to capture long-range interactions and context,
leading to synthesizing interesting phenomena in scenes, such as reflections of
landscapes onto water or flora consistent with the rest of the landscape, that
were not possible to generate reliably with previous convnets and transformer
approaches. We present qualitative and quantitative results, along with user
studies, demonstrating the effectiveness of our method.
Related papers
- Empowering Image Recovery_ A Multi-Attention Approach [96.25892659985342]
Diverse Restormer (DART) is an image restoration method that integrates information from various sources to address restoration challenges.
DART employs customized attention mechanisms to enhance overall performance.
evaluation across five restoration tasks consistently positions DART at the forefront.
arXiv Detail & Related papers (2024-04-06T12:50:08Z) - IPT-V2: Efficient Image Processing Transformer using Hierarchical Attentions [26.09373405194564]
We present an efficient image processing transformer architecture with hierarchical attentions, called IPTV2.
We adopt a focal context self-attention (FCSA) and a global grid self-attention (GGSA) to obtain adequate token interactions in local and global receptive fields.
Our proposed IPT-V2 achieves state-of-the-art results on various image processing tasks, covering denoising, deblurring, deraining and obtains much better trade-off for performance and computational complexity than previous methods.
arXiv Detail & Related papers (2024-03-31T10:01:20Z) - High-resolution power equipment recognition based on improved
self-attention [11.24310344443672]
This paper introduces a novel improvement on deep self-attention networks tailored for this issue.
The proposed model comprises four key components: a foundational network, a region proposal network, a module for extracting and segmenting target areas, and a final prediction network.
The deep self-attention network's prediction mechanism uniquely incorporates the semantic context of images, resulting in substantially improved recognition performance.
arXiv Detail & Related papers (2023-11-06T20:51:37Z) - Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient
Vision Transformers [41.78970081787674]
We propose a more efficient two-stage framework for high-resolution image generation.
We employ a local attention-based quantization model instead of the global attention mechanism used in previous methods.
This approach results in faster generation speed, higher generation fidelity, and improved resolution.
arXiv Detail & Related papers (2023-10-09T04:38:52Z) - Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics.
By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information.
One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - Blending Anti-Aliasing into Vision Transformer [57.88274087198552]
discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps.
Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions.
We propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue.
arXiv Detail & Related papers (2021-10-28T14:30:02Z) - Grid Partitioned Attention: Efficient TransformerApproximation with
Inductive Bias for High Resolution Detail Generation [3.4373727078460665]
We present Grid Partitioned Attention (GPA), a new approximate attention algorithm.
Our paper introduces the new attention layer, analyzes its complexity and how the trade-off between memory usage and model power can be tuned.
Our contributions are (i) algorithm and code1of the novel GPA layer, (ii) a novel deep attention-copying architecture, and (iii) new state-of-the art experimental results in human pose morphing generation benchmarks.
arXiv Detail & Related papers (2021-07-08T10:37:23Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Attention-based Image Upsampling [14.676228848773157]
We show how attention mechanisms can be used to replace another canonical operation: strided transposed convolution.
We show that attention-based upsampling consistently outperforms traditional upsampling methods.
arXiv Detail & Related papers (2020-12-17T19:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.