Gumbel-Attention for Multi-modal Machine Translation
- URL: http://arxiv.org/abs/2103.08862v1
- Date: Tue, 16 Mar 2021 05:44:01 GMT
- Title: Gumbel-Attention for Multi-modal Machine Translation
- Authors: Pengbo Liu, Hailong Cao, Tiejun Zhao
- Abstract summary: Multi-modal machine translation (MMT) improves translation quality by introducing visual information.
The existing MMT model ignores the problem that the image will bring information irrelevant to the text, causing much noise to the model and affecting the translation quality.
We propose a novel Gumbel-Attention for multi-modal machine translation, which selects the text-related parts of the image features.
- Score: 18.4381138617661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal machine translation (MMT) improves translation quality by
introducing visual information. However, the existing MMT model ignores the
problem that the image will bring information irrelevant to the text, causing
much noise to the model and affecting the translation quality. In this paper,
we propose a novel Gumbel-Attention for multi-modal machine translation, which
selects the text-related parts of the image features. Specifically, different
from the previous attention-based method, we first use a differentiable method
to select the image information and automatically remove the useless parts of
the image features. Through the score matrix of Gumbel-Attention and image
features, the image-aware text representation is generated. And then, we
independently encode the text representation and the image-aware text
representation with the multi-modal encoder. Finally, the final output of the
encoder is obtained through multi-modal gated fusion. Experiments and case
analysis proves that our method retains the image features related to the text,
and the remaining parts help the MMT model generates better translations.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Distill the Image to Nowhere: Inversion Knowledge Distillation for
Multimodal Machine Translation [6.845232643246564]
We introduce IKD-MMT, a novel MMT framework to support the image-free inference phase via an inversion knowledge distillation scheme.
A multimodal feature generator is executed with a knowledge distillation module, which directly generates the multimodal feature from only source texts.
In experiments, we identify our method as the first image-free approach to comprehensively rival or even surpass (almost) all image-must frameworks.
arXiv Detail & Related papers (2022-10-10T07:36:59Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - MCMI: Multi-Cycle Image Translation with Mutual Information Constraints [40.556049046897115]
We present a mutual information-based framework for unsupervised image-to-image translation.
Our MCMI approach treats single-cycle image translation models as modules that can be used recurrently in a multi-cycle translation setting.
We show that models trained with MCMI produce higher quality images and learn more semantically-relevant mappings.
arXiv Detail & Related papers (2020-07-06T17:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.