WITT: A Wireless Image Transmission Transformer for Semantic
Communications
- URL: http://arxiv.org/abs/2211.00937v1
- Date: Wed, 2 Nov 2022 07:50:27 GMT
- Title: WITT: A Wireless Image Transmission Transformer for Semantic
Communications
- Authors: Ke Yang, Sixian Wang, Jincheng Dai, Kailin Tan, Kai Niu, Ping Zhang
- Abstract summary: We redesign vision Transformer (ViT) as a new backbone to realize wireless image transmission transformer (WITT)
WITT is highly optimized for image transmission while considering the effect of the wireless channel.
Our experiments verify that our WITT attains better performance for different image resolutions, distortion metrics, and channel conditions.
- Score: 11.480385893433802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we aim to redesign the vision Transformer (ViT) as a new
backbone to realize semantic image transmission, termed wireless image
transmission transformer (WITT). Previous works build upon convolutional neural
networks (CNNs), which are inefficient in capturing global dependencies,
resulting in degraded end-to-end transmission performance especially for
high-resolution images. To tackle this, the proposed WITT employs Swin
Transformers as a more capable backbone to extract long-range information.
Different from ViTs in image classification tasks, WITT is highly optimized for
image transmission while considering the effect of the wireless channel.
Specifically, we propose a spatial modulation module to scale the latent
representations according to channel state information, which enhances the
ability of a single model to deal with various channel conditions. As a result,
extensive experiments verify that our WITT attains better performance for
different image resolutions, distortion metrics, and channel conditions. The
code is available at https://github.com/KeYang8/WITT.
Related papers
- Vision Transformer Based Semantic Communications for Next Generation Wireless Networks [3.8095664680229935]
This paper presents a Vision Transformer (ViT)-based semantic communication framework.
By equipping ViT as the encoder-decoder framework, the proposed architecture can proficiently encode images into a high semantic content.
The architecture based on the proposed ViT network achieves the Peak Signal-versato-noise Ratio (PSNR) of 38 dB.
arXiv Detail & Related papers (2025-03-21T16:23:02Z) - Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets [11.95214938154427]
Vision Transformer (ViT) captures global information by dividing images into patches.
ViT lacks inductive bias during image or video dataset training.
We present a lightweight Depth-Wise Convolution module as a shortcut in ViT models.
arXiv Detail & Related papers (2024-07-28T04:23:40Z) - Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words [7.210982964205077]
Vision Transformer (ViT) has emerged as a powerful architecture in modern computer vision.
However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges.
We propose a modification to the ViT architecture that enhances reasoning across the input channels.
arXiv Detail & Related papers (2023-09-28T02:20:59Z) - Patch Is Not All You Need [57.290256181083016]
We propose a novel Pattern Transformer to adaptively convert images to pattern sequences for Transformer input.
We employ the Convolutional Neural Network to extract various patterns from the input image.
We have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
arXiv Detail & Related papers (2023-08-21T13:54:00Z) - Explicitly Increasing Input Information Density for Vision Transformers
on Small Datasets [26.257612622358614]
Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks.
This paper proposes to explicitly increase the input information density in the frequency domain.
Experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets.
arXiv Detail & Related papers (2022-10-25T20:24:53Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - Wireless Deep Video Semantic Transmission [14.071114007641313]
We propose a new class of high-efficiency deep joint source-channel coding methods to achieve end-to-end video transmission over wireless channels.
Our framework is collected under the name deep video semantic transmission (DVST)
arXiv Detail & Related papers (2022-05-26T03:26:43Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - Multiscale Vision Transformers [79.76412415996892]
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks.
arXiv Detail & Related papers (2021-04-22T17:59:45Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.