Related papers: MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

URL: http://arxiv.org/abs/2412.07147v2
Date: Mon, 16 Dec 2024 09:28:53 GMT
Title: MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Authors: Bo Li, Shaolin Zhu, Lijie Wen,
Abstract summary: We introduce MIT-10M, a large-scale parallel corpus of multilingual image translation with over 10M image-text pairs derived from real-world data.<n>It contains 840K images in three sizes, 28 categories, tasks with three levels of difficulty and 14 languages image-text pairs, which is a considerable improvement on existing datasets.
Score: 19.068161657967007
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Image Translation (IT) holds immense potential across diverse domains, enabling the translation of textual content within images into various languages. However, existing datasets often suffer from limitations in scale, diversity, and quality, hindering the development and evaluation of IT models. To address this issue, we introduce MIT-10M, a large-scale parallel corpus of multilingual image translation with over 10M image-text pairs derived from real-world data, which has undergone extensive data cleaning and multilingual translation validation. It contains 840K images in three sizes, 28 categories, tasks with three levels of difficulty and 14 languages image-text pairs, which is a considerable improvement on existing datasets. We conduct extensive experiments to evaluate and train models on MIT-10M. The experimental results clearly indicate that our dataset has higher adaptability when it comes to evaluating the performance of the models in tackling challenging and complex image translation tasks in the real world. Moreover, the performance of the model fine-tuned with MIT-10M has tripled compared to the baseline model, further confirming its superiority.

Related papers

OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding [13.03315906747549]
We introduce OmniScience, a high-fidelity multi-modal dataset spanning more than 10 major scientific disciplines.<n>We develop a dynamic model-routing re-captioning pipeline that generates dense, self-contained descriptions.<n> pipeline is reinforced with rigorous quality filtering and alignment with human expert judgments.
arXiv Detail & Related papers (2026-02-14T13:08:13Z)
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Our dataset has 15 times larger scales while maintaining good data quality. We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z)
Exploring Better Text Image Translation with Multimodal Codebook [39.12169843196739]
Text image translation (TIT) aims to translate the source texts embedded in the image to target translations. In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies. Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts. We present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts.
arXiv Detail & Related papers (2023-05-27T08:41:18Z)
Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models. We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks. OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z)
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z)
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning [19.203716881791312]
We introduce the Wikipedia-based Image Text (WIT) dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. WIT is the largest multimodal dataset by the number of image-text examples by 3x.
arXiv Detail & Related papers (2021-03-02T18:13:54Z)
Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.