LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation
- URL: http://arxiv.org/abs/2210.15461v1
- Date: Wed, 19 Oct 2022 12:21:39 GMT
- Title: LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation
- Authors: Hongcheng Guo, Jiaheng Liu, Haoyang Huang, Jian Yang, Zhoujun Li,
Dongdong Zhang, Furu Wei
- Abstract summary: Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features.
Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases.
We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
- Score: 94.33019040320507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Machine Translation (MMT) focuses on enhancing text-only
translation with visual features, which has attracted considerable attention
from both natural language processing and computer vision communities. Recent
advances still struggle to train a separate model for each language pair, which
is costly and unaffordable when the number of languages increases in the real
world. In other words, the multilingual multimodal machine translation
(Multilingual MMT) task has not been investigated, which aims to handle the
aforementioned issues by providing a shared semantic space for multiple
languages. Besides, the image modality has no language boundaries, which is
superior to bridging the semantic gap between languages. To this end, we first
propose the Multilingual MMT task by establishing two new Multilingual MMT
benchmark datasets covering seven languages. Then, an effective baseline LVP-M3
using visual prompts is proposed to support translations between different
languages, which includes three stages (token encoding, language-aware visual
prompt generation, and language translation). Extensive experimental results on
our constructed benchmark datasets demonstrate the effectiveness of LVP-M3
method for Multilingual MMT.
Related papers
- Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods mainly focus on aligning vision encoders with Multimodal Large Language Models (MLLMs)
We introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level.
Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks.
arXiv Detail & Related papers (2024-06-04T17:56:28Z) - m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt [39.2728779674405]
We propose a framework to leverage the multimodal prompt to guide the Multimodal Multilingual neural Machine Translation (m3P)
Our method aims to minimize the representation distance of different languages by regarding the image as a central language.
Experimental results show that m3P outperforms previous text-only baselines and multilingual multimodal methods by a large margin.
arXiv Detail & Related papers (2024-03-26T10:04:24Z) - MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for
Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages.
Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals.
We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z) - Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data.
We propose two metrics for automatically removing such translations from the resulting datasets.
In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - M3P: Learning Universal Representations via Multitask Multilingual
Multimodal Pre-training [119.16007395162431]
M3P is a Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training.
We show that M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.
arXiv Detail & Related papers (2020-06-04T03:54:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.