Tackling Ambiguity with Images: Improved Multimodal Machine Translation
and Contrastive Evaluation
- URL: http://arxiv.org/abs/2212.10140v2
- Date: Fri, 26 May 2023 10:52:39 GMT
- Title: Tackling Ambiguity with Images: Improved Multimodal Machine Translation
and Contrastive Evaluation
- Authors: Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Beno\^it Sagot, Rachel
Bawden
- Abstract summary: We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism.
We also introduce CoMMuTE, a Contrastive Multimodal Translation Evaluation set of ambiguous sentences and their possible translations.
Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks.
- Score: 72.6667341525552
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: One of the major challenges of machine translation (MT) is ambiguity, which
can in some cases be resolved by accompanying context such as images. However,
recent work in multimodal MT (MMT) has shown that obtaining improvements from
images is challenging, limited not only by the difficulty of building effective
cross-modal representations, but also by the lack of specific evaluation and
training data. We present a new MMT approach based on a strong text-only MT
model, which uses neural adapters, a novel guided self-attention mechanism and
which is jointly trained on both visually-conditioned masking and MMT. We also
introduce CoMMuTE, a Contrastive Multilingual Multimodal Translation Evaluation
set of ambiguous sentences and their possible translations, accompanied by
disambiguating images corresponding to each translation. Our approach obtains
competitive results compared to strong text-only models on standard
English-to-French, English-to-German and English-to-Czech benchmarks and
outperforms baselines and state-of-the-art MMT systems by a large margin on our
contrastive test set. Our code and CoMMuTE are freely available.
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
Different Modalities as Different Languages [96.8603701943286]
Tri-Modal Translation (TMT) model translates between arbitrary modalities spanning speech, image, and text.
We tokenize speech and image data into discrete tokens, which provide a unified interface across modalities.
TMT outperforms single model counterparts consistently.
arXiv Detail & Related papers (2024-02-25T07:46:57Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - STEMM: Self-learning with Speech-text Manifold Mixup for Speech
Translation [37.51435498386953]
We propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy.
Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy.
arXiv Detail & Related papers (2022-03-20T01:49:53Z) - MCMI: Multi-Cycle Image Translation with Mutual Information Constraints [40.556049046897115]
We present a mutual information-based framework for unsupervised image-to-image translation.
Our MCMI approach treats single-cycle image translation models as modules that can be used recurrently in a multi-cycle translation setting.
We show that models trained with MCMI produce higher quality images and learn more semantically-relevant mappings.
arXiv Detail & Related papers (2020-07-06T17:50:43Z) - Unsupervised Multimodal Neural Machine Translation with Pseudo Visual
Pivoting [105.5303416210736]
Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only.
It is still challenging to associate source-target sentences in the latent space.
As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising.
arXiv Detail & Related papers (2020-05-06T20:11:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.