Language-Oriented Semantic Latent Representation for Image Transmission
- URL: http://arxiv.org/abs/2405.09976v1
- Date: Thu, 16 May 2024 10:41:31 GMT
- Title: Language-Oriented Semantic Latent Representation for Image Transmission
- Authors: Giordano Cicchetti, Eleonora Grassucci, Jihong Park, Jinho Choi, Sergio Barbarossa, Danilo Comminiello,
- Abstract summary: New paradigm of semantic communication (SC) focuses on delivering meanings behind bits.
Recent advances in data-to-text models facilitate language-oriented SC.
We propose a novel SC framework that communicates both text and a compressed image embedding.
- Score: 38.62941652189033
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the new paradigm of semantic communication (SC), the focus is on delivering meanings behind bits by extracting semantic information from raw data. Recent advances in data-to-text models facilitate language-oriented SC, particularly for text-transformed image communication via image-to-text (I2T) encoding and text-to-image (T2I) decoding. However, although semantically aligned, the text is too coarse to precisely capture sophisticated visual features such as spatial locations, color, and texture, incurring a significant perceptual difference between intended and reconstructed images. To address this limitation, in this paper, we propose a novel language-oriented SC framework that communicates both text and a compressed image embedding and combines them using a latent diffusion model to reconstruct the intended image. Experimental results validate the potential of our approach, which transmits only 2.09\% of the original image size while achieving higher perceptual similarities in noisy communication channels compared to a baseline SC method that communicates only through text.The code is available at https://github.com/ispamm/Img2Img-SC/ .
Related papers
- Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - Perceptual Image Compression with Cooperative Cross-Modal Side
Information [53.356714177243745]
We propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff.
Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features.
arXiv Detail & Related papers (2023-11-23T08:31:11Z) - Towards Better Text-Image Consistency in Text-to-Image Generation [15.735515302139335]
We develop a novel CLIP-based metric termed as Semantic Similarity Distance (SSD)
We further design the Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN), which can fuse semantic information at different granularities.
Our PDF-GAN can lead to significantly better text-image consistency while maintaining decent image quality on the CUB and COCO datasets.
arXiv Detail & Related papers (2022-10-27T07:47:47Z) - Towards Semantic Communications: Deep Learning-Based Image Semantic
Coding [42.453963827153856]
We conceive the semantic communications for image data that is much more richer in semantics and bandwidth sensitive.
We propose an reinforcement learning based adaptive semantic coding (RL-ASC) approach that encodes images beyond pixel level.
Experimental results demonstrate that the proposed RL-ASC is noise robust and could reconstruct visually pleasant and semantic consistent image.
arXiv Detail & Related papers (2022-08-08T12:29:55Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.