Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets
- URL: http://arxiv.org/abs/2404.06107v1
- Date: Tue, 9 Apr 2024 08:19:10 GMT
- Title: Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets
- Authors: Zi Long, Zhenhao Tang, Xianghua Fu, Jian Chen, Shilong Hou, Jinze Lyu,
- Abstract summary: We study the impact of the visual modality on translation efficacy by leveraging real-world translation datasets.
We find that the visual modality proves advantageous for the majority of authentic translation datasets.
Our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.
- Score: 3.54128607634285
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recent research in the field of multimodal machine translation (MMT) has indicated that the visual modality is either dispensable or offers only marginal advantages. However, most of these conclusions are drawn from the analysis of experimental results based on a limited set of bilingual sentence-image pairs, such as Multi30k. In these kinds of datasets, the content of one bilingual parallel sentence pair must be well represented by a manually annotated image, which is different from the real-world translation scenario. In this work, we adhere to the universal multimodal machine translation framework proposed by Tang et al. (2022). This approach allows us to delve into the impact of the visual modality on translation efficacy by leveraging real-world translation datasets. Through a comprehensive exploration via probing tasks, we find that the visual modality proves advantageous for the majority of authentic translation datasets. Notably, the translation performance primarily hinges on the alignment and coherence between textual and visual contents. Furthermore, our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.
Related papers
- Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
Retrieval [57.98555925471121]
Cross-lingual cross-modal retrieval has attracted increasing attention.
Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation.
We propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR.
arXiv Detail & Related papers (2023-09-11T13:44:46Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Multimodal Neural Machine Translation with Search Engine Based Image
Retrieval [4.662583832063716]
We propose an open-vocabulary image retrieval method to collect descriptive images for bilingual parallel corpus.
Our proposed method achieves significant improvements over strong baselines.
arXiv Detail & Related papers (2022-07-26T08:42:06Z) - VALHALLA: Visual Hallucination for Machine Translation [64.86515924691899]
We introduce a visual hallucination framework, called VALHALLA.
It requires only source sentences at inference time and instead uses hallucinated visual representations for multimodal machine translation.
In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text.
arXiv Detail & Related papers (2022-05-31T20:25:15Z) - ViTA: Visual-Linguistic Translation by Aligning Object Tags [7.817598216459955]
Multimodal Machine Translation (MMT) enriches the source text with visual information for translation.
We propose our system for the Multimodal Translation Task of WAT 2021 from English to Hindi.
arXiv Detail & Related papers (2021-06-01T06:19:29Z) - Simultaneous Machine Translation with Visual Context [42.88121241096681]
Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible.
We analyse the impact of different multimodal approaches and visual features on state-of-the-art SiMT frameworks.
arXiv Detail & Related papers (2020-09-15T18:19:11Z) - Unsupervised Multimodal Neural Machine Translation with Pseudo Visual
Pivoting [105.5303416210736]
Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only.
It is still challenging to associate source-target sentences in the latent space.
As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising.
arXiv Detail & Related papers (2020-05-06T20:11:46Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.