Related papers: Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

URL: http://arxiv.org/abs/2108.11119v1
Date: Wed, 25 Aug 2021 08:36:01 GMT
Title: Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training
Authors: Yuqing Song, Shizhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang
Abstract summary: Product-oriented machine translation (PMT) is essential to serve e-shoppers all over the world. Due to the domain specialty, the PMT task is more challenging than traditional machine translation problems. In this paper, we first construct a large-scale bilingual product description dataset called Fashion-MMT. We design a unified product-oriented cross-modal cross-lingual model (upoc) for pre-training and fine-tuning.
Score: 47.18792577471746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Translating e-commercial product descriptions, a.k.a product-oriented machine translation (PMT), is essential to serve e-shoppers all over the world. However, due to the domain specialty, the PMT task is more challenging than traditional machine translation problems. Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image. Secondly, product descriptions are related to the image in more complicated ways than standard image descriptions, involving various visual aspects such as objects, shapes, colors or even subjective styles. Moreover, existing PMT datasets are small in scale to support the research. In this paper, we first construct a large-scale bilingual product description dataset called Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations with multiple product images. To effectively learn semantic alignments among product images and bilingual texts in translation, we design a unified product-oriented cross-modal cross-lingual model (\upoc~) for pre-training and fine-tuning. Experiments on the Fashion-MMT and Multi30k datasets show that our model significantly outperforms the state-of-the-art models even pre-trained on the same dataset. It is also shown to benefit more from large-scale noisy data to improve the translation quality. We will release the dataset and codes at https://github.com/syuqings/Fashion-MMT.

Related papers

ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT [1.5546909030871632]
This research explores how adding information to models can improve translations in the context of e-commerce data.<n>We create ConECT -- a new Czech-to-Polish e-commerce product translation dataset.<n>We test a vision-language model (VLM), finding that visual context aids translation quality.
arXiv Detail & Related papers (2025-06-05T12:02:01Z)
Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models [43.16111789538798]
We build parallel multilingual prompts aimed at harnessing the multilingual capabilities of large multimodal models (LMMs) Experiments on two LMMs across 3 benchmarks show that our method, PMT2I achieves, superior performance in general, compositional, and fine-grained assessments.
arXiv Detail & Related papers (2025-01-13T06:41:23Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z)
Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets [3.54128607634285]
We study the impact of the visual modality on translation efficacy by leveraging real-world translation datasets. We find that the visual modality proves advantageous for the majority of authentic translation datasets. Our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.
arXiv Detail & Related papers (2024-04-09T08:19:10Z)
A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation [47.70824723223262]
We propose a new setting for generating product descriptions from images, augmented by marketing keywords. We present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference. Experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods.
arXiv Detail & Related papers (2024-02-21T07:38:29Z)
A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation [13.426403221815063]
This paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation. We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.
arXiv Detail & Related papers (2023-06-12T15:56:10Z)
Exploring Better Text Image Translation with Multimodal Codebook [39.12169843196739]
Text image translation (TIT) aims to translate the source texts embedded in the image to target translations. In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies. Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts. We present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts.
arXiv Detail & Related papers (2023-05-27T08:41:18Z)
Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models. We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks. OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z)
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.