Bridging the Gap between Synthetic and Authentic Images for Multimodal
Machine Translation
- URL: http://arxiv.org/abs/2310.13361v1
- Date: Fri, 20 Oct 2023 09:06:30 GMT
- Title: Bridging the Gap between Synthetic and Authentic Images for Multimodal
Machine Translation
- Authors: Wenyu Guo, Qingkai Fang, Dong Yu, Yang Feng
- Abstract summary: Multimodal machine translation (MMT) simultaneously takes the source sentence and a relevant image as input for translation.
Recent studies suggest utilizing powerful text-to-image generation models to provide image inputs.
However, synthetic images generated by these models often follow different distributions compared to authentic images.
- Score: 51.37092275604371
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal machine translation (MMT) simultaneously takes the source sentence
and a relevant image as input for translation. Since there is no paired image
available for the input sentence in most cases, recent studies suggest
utilizing powerful text-to-image generation models to provide image inputs.
Nevertheless, synthetic images generated by these models often follow different
distributions compared to authentic images. Consequently, using authentic
images for training and synthetic images for inference can introduce a
distribution shift, resulting in performance degradation during inference. To
tackle this challenge, in this paper, we feed synthetic and authentic images to
the MMT model, respectively. Then we minimize the gap between the synthetic and
authentic images by drawing close the input image representations of the
Transformer Encoder and the output distributions of the Transformer Decoder.
Therefore, we mitigate the distribution disparity introduced by the synthetic
images during inference, thereby freeing the authentic images from the
inference process.Experimental results show that our approach achieves
state-of-the-art performance on the Multi30K En-De and En-Fr datasets, while
remaining independent of authentic images during inference.
Related papers
- Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model [101.65105730838346]
We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data.
We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data.
Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens.
arXiv Detail & Related papers (2024-08-20T17:48:20Z) - Image Inpainting via Tractable Steering of Diffusion Models [54.13818673257381]
This paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior.
Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs)
We show that our approach can consistently improve the overall quality and semantic coherence of inpainted images with only 10% additional computational overhead.
arXiv Detail & Related papers (2023-11-28T21:14:02Z) - Real-World Image Variation by Aligning Diffusion Inversion Chain [53.772004619296794]
A domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images.
We propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL)
Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain.
arXiv Detail & Related papers (2023-05-30T04:09:47Z) - Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models.
In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model.
Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z) - Improved Masked Image Generation with Token-Critic [16.749458173904934]
We introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer.
A state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity.
arXiv Detail & Related papers (2022-09-09T17:57:21Z) - Unsupervised Medical Image Translation with Adversarial Diffusion Models [0.2770822269241974]
Imputation of missing images via source-to-target modality translation can improve diversity in medical imaging protocols.
Here, we propose a novel method based on adversarial diffusion modeling, SynDiff, for improved performance in medical image translation.
arXiv Detail & Related papers (2022-07-17T15:53:24Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder [73.1010640692609]
We propose a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis.
Our model achieves state-of-the-art results and generates more photorealistic images specifically.
arXiv Detail & Related papers (2022-06-01T10:39:12Z) - Paired Image-to-Image Translation Quality Assessment Using Multi-Method
Fusion [0.0]
This paper proposes a novel approach that combines signals of image quality between paired source and transformation to predict the latter's similarity with a hypothetical ground truth.
We trained a Multi-Method Fusion (MMF) model via an ensemble of gradient-boosted regressors to predict Deep Image Structure and Texture Similarity (DISTS)
Analysis revealed the task to be feature-constrained, introducing a trade-off at inference between metric time and prediction accuracy.
arXiv Detail & Related papers (2022-05-09T11:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.