Product-oriented Machine Translation with Cross-modal Cross-lingual
Pre-training
- URL: http://arxiv.org/abs/2108.11119v1
- Date: Wed, 25 Aug 2021 08:36:01 GMT
- Title: Product-oriented Machine Translation with Cross-modal Cross-lingual
Pre-training
- Authors: Yuqing Song, Shizhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang
- Abstract summary: Product-oriented machine translation (PMT) is essential to serve e-shoppers all over the world.
Due to the domain specialty, the PMT task is more challenging than traditional machine translation problems.
In this paper, we first construct a large-scale bilingual product description dataset called Fashion-MMT.
We design a unified product-oriented cross-modal cross-lingual model (upoc) for pre-training and fine-tuning.
- Score: 47.18792577471746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Translating e-commercial product descriptions, a.k.a product-oriented machine
translation (PMT), is essential to serve e-shoppers all over the world.
However, due to the domain specialty, the PMT task is more challenging than
traditional machine translation problems. Firstly, there are many specialized
jargons in the product description, which are ambiguous to translate without
the product image. Secondly, product descriptions are related to the image in
more complicated ways than standard image descriptions, involving various
visual aspects such as objects, shapes, colors or even subjective styles.
Moreover, existing PMT datasets are small in scale to support the research. In
this paper, we first construct a large-scale bilingual product description
dataset called Fashion-MMT, which contains over 114k noisy and 40k manually
cleaned description translations with multiple product images. To effectively
learn semantic alignments among product images and bilingual texts in
translation, we design a unified product-oriented cross-modal cross-lingual
model (\upoc~) for pre-training and fine-tuning. Experiments on the Fashion-MMT
and Multi30k datasets show that our model significantly outperforms the
state-of-the-art models even pre-trained on the same dataset. It is also shown
to benefit more from large-scale noisy data to improve the translation quality.
We will release the dataset and codes at
https://github.com/syuqings/Fashion-MMT.
Related papers
- ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT [1.5546909030871632]
This research explores how adding information to models can improve translations in the context of e-commerce data.<n>We create ConECT -- a new Czech-to-Polish e-commerce product translation dataset.<n>We test a vision-language model (VLM), finding that visual context aids translation quality.
arXiv Detail & Related papers (2025-06-05T12:02:01Z) - Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models [43.16111789538798]
We build parallel multilingual prompts aimed at harnessing the multilingual capabilities of large multimodal models (LMMs)
Experiments on two LMMs across 3 benchmarks show that our method, PMT2I achieves, superior performance in general, compositional, and fine-grained assessments.
arXiv Detail & Related papers (2025-01-13T06:41:23Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese.
Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets.
Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z) - Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets [3.54128607634285]
We study the impact of the visual modality on translation efficacy by leveraging real-world translation datasets.
We find that the visual modality proves advantageous for the majority of authentic translation datasets.
Our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.
arXiv Detail & Related papers (2024-04-09T08:19:10Z) - A Multimodal In-Context Tuning Approach for E-Commerce Product
Description Generation [47.70824723223262]
We propose a new setting for generating product descriptions from images, augmented by marketing keywords.
We present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference.
Experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods.
arXiv Detail & Related papers (2024-02-21T07:38:29Z) - A Survey of Vision-Language Pre-training from the Lens of Multimodal
Machine Translation [13.426403221815063]
This paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation.
We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.
arXiv Detail & Related papers (2023-06-12T15:56:10Z) - Exploring Better Text Image Translation with Multimodal Codebook [39.12169843196739]
Text image translation (TIT) aims to translate the source texts embedded in the image to target translations.
In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies.
Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts.
We present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts.
arXiv Detail & Related papers (2023-05-27T08:41:18Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.