Distill the Image to Nowhere: Inversion Knowledge Distillation for
Multimodal Machine Translation
- URL: http://arxiv.org/abs/2210.04468v2
- Date: Fri, 21 Apr 2023 09:40:09 GMT
- Title: Distill the Image to Nowhere: Inversion Knowledge Distillation for
Multimodal Machine Translation
- Authors: Ru Peng, Yawen Zeng, Junbo Zhao
- Abstract summary: We introduce IKD-MMT, a novel MMT framework to support the image-free inference phase via an inversion knowledge distillation scheme.
A multimodal feature generator is executed with a knowledge distillation module, which directly generates the multimodal feature from only source texts.
In experiments, we identify our method as the first image-free approach to comprehensively rival or even surpass (almost) all image-must frameworks.
- Score: 6.845232643246564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Past works on multimodal machine translation (MMT) elevate bilingual setup by
incorporating additional aligned vision information. However, an image-must
requirement of the multimodal dataset largely hinders MMT's development --
namely that it demands an aligned form of [image, source text, target text].
This limitation is generally troublesome during the inference phase especially
when the aligned image is not provided as in the normal NMT setup. Thus, in
this work, we introduce IKD-MMT, a novel MMT framework to support the
image-free inference phase via an inversion knowledge distillation scheme. In
particular, a multimodal feature generator is executed with a knowledge
distillation module, which directly generates the multimodal feature from
(only) source texts as the input. While there have been a few prior works
entertaining the possibility to support image-free inference for machine
translation, their performances have yet to rival the image-must translation.
In our experiments, we identify our method as the first image-free approach to
comprehensively rival or even surpass (almost) all image-must frameworks, and
achieved the state-of-the-art result on the often-used Multi30k benchmark. Our
code and data are available at: https://github.com/pengr/IKD-mmt/tree/master..
Related papers
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - Kosmos-G: Generating Images in Context with Multimodal Large Language Models [117.0259361818715]
Current subject-driven image generation methods require test-time tuning and cannot accept interleaved multi-image and text input.
This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models.
Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input.
arXiv Detail & Related papers (2023-10-04T17:28:44Z) - CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for
Multimodal Machine Translation [31.911593690549633]
multimodal machine translation (MMT) systems enhance neural machine translation (NMT) with visual knowledge.
Previous works face a challenge in training powerful MMT models from scratch due to the scarcity of annotated multilingual vision-language data.
We propose CLIPTrans, which simply adapts the independently pre-trained multimodal M-CLIP and the multilingual mBART.
arXiv Detail & Related papers (2023-08-29T11:29:43Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Tackling Ambiguity with Images: Improved Multimodal Machine Translation
and Contrastive Evaluation [72.6667341525552]
We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism.
We also introduce CoMMuTE, a Contrastive Multimodal Translation Evaluation set of ambiguous sentences and their possible translations.
Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks.
arXiv Detail & Related papers (2022-12-20T10:18:18Z) - Gumbel-Attention for Multi-modal Machine Translation [18.4381138617661]
Multi-modal machine translation (MMT) improves translation quality by introducing visual information.
The existing MMT model ignores the problem that the image will bring information irrelevant to the text, causing much noise to the model and affecting the translation quality.
We propose a novel Gumbel-Attention for multi-modal machine translation, which selects the text-related parts of the image features.
arXiv Detail & Related papers (2021-03-16T05:44:01Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.