Vision Matters When It Should: Sanity Checking Multimodal Machine
Translation Models
- URL: http://arxiv.org/abs/2109.03415v1
- Date: Wed, 8 Sep 2021 03:32:48 GMT
- Title: Vision Matters When It Should: Sanity Checking Multimodal Machine
Translation Models
- Authors: Jiaoda Li, Duygu Ataman, Rico Sennrich
- Abstract summary: Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available.
Recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise.
- Score: 25.920891392933058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal machine translation (MMT) systems have been shown to outperform
their text-only neural machine translation (NMT) counterparts when visual
context is available. However, recent studies have also shown that the
performance of MMT models is only marginally impacted when the associated image
is replaced with an unrelated image or noise, which suggests that the visual
context might not be exploited by the model at all. We hypothesize that this
might be caused by the nature of the commonly used evaluation benchmark, also
known as Multi30K, where the translations of image captions were prepared
without actually showing the images to human translators. In this paper, we
present a qualitative study that examines the role of datasets in stimulating
the leverage of visual modality and we propose methods to highlight the
importance of visual signals in the datasets which demonstrate improvements in
reliance of models on the source images. Our findings suggest the research on
effective MMT architectures is currently impaired by the lack of suitable
datasets and careful consideration must be taken in creation of future MMT
datasets, for which we also provide useful insights.
Related papers
- 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese.
Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets.
Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Increasing Visual Awareness in Multimodal Neural Machine Translation
from an Information Theoretic Perspective [14.100033405711685]
Multimodal machine translation (MMT) aims to improve translation quality by equipping the source sentence with its corresponding image.
In this paper, we endeavor to improve MMT performance by increasing visual awareness from an information theoretic perspective.
arXiv Detail & Related papers (2022-10-16T08:11:44Z) - Neural Machine Translation with Phrase-Level Universal Visual
Representations [11.13240570688547]
We propose a phrase-level retrieval-based method for MMT to get visual information for the source input from existing sentence-image data sets.
Our method performs retrieval at the phrase level and hence learns visual information from pairs of source phrase and grounded region.
Experiments show that the proposed method significantly outperforms strong baselines on multiple MMT datasets.
arXiv Detail & Related papers (2022-03-19T11:21:13Z) - On Vision Features in Multimodal Machine Translation [34.41229863267296]
We develop a selective attention model to study the patch-level contribution of an image in multimodal machine translation.
Our results suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased.
arXiv Detail & Related papers (2022-03-17T08:51:09Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - Efficient Object-Level Visual Context Modeling for Multimodal Machine
Translation: Masking Irrelevant Objects Helps Grounding [25.590409802797538]
We propose an object-level visual context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation.
OVC encourages MMT to ground translation on desirable visual objects by masking irrelevant objects in the visual modality.
Experiments on MMT datasets demonstrate that the proposed OVC model outperforms state-of-the-art MMT models.
arXiv Detail & Related papers (2020-12-18T11:10:00Z) - Unsupervised Multimodal Neural Machine Translation with Pseudo Visual
Pivoting [105.5303416210736]
Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only.
It is still challenging to associate source-target sentences in the latent space.
As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising.
arXiv Detail & Related papers (2020-05-06T20:11:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.