Exploring Better Text Image Translation with Multimodal Codebook
- URL: http://arxiv.org/abs/2305.17415v2
- Date: Fri, 2 Jun 2023 12:38:37 GMT
- Title: Exploring Better Text Image Translation with Multimodal Codebook
- Authors: Zhibin Lan, Jiawei Yu, Xiang Li, Wen Zhang, Jian Luan, Bin Wang, Degen
Huang, Jinsong Su
- Abstract summary: Text image translation (TIT) aims to translate the source texts embedded in the image to target translations.
In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies.
Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts.
We present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts.
- Score: 39.12169843196739
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text image translation (TIT) aims to translate the source texts embedded in
the image to target translations, which has a wide range of applications and
thus has important research value. However, current studies on TIT are
confronted with two main bottlenecks: 1) this task lacks a publicly available
TIT dataset, 2) dominant models are constructed in a cascaded manner, which
tends to suffer from the error propagation of optical character recognition
(OCR). In this work, we first annotate a Chinese-English TIT dataset named
OCRMT30K, providing convenience for subsequent studies. Then, we propose a TIT
model with a multimodal codebook, which is able to associate the image with
relevant texts, providing useful supplementary information for translation.
Moreover, we present a multi-stage training framework involving text machine
translation, image-text alignment, and TIT tasks, which fully exploits
additional bilingual texts, OCR dataset and our OCRMT30K dataset to train our
model. Extensive experiments and in-depth analyses strongly demonstrate the
effectiveness of our proposed model and training framework.
Related papers
- AnyTrans: Translate AnyText in the Image with Large Scale Models [88.5887934499388]
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI)
Our framework incorporates contextual cues from both textual and visual elements during translation.
We have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
arXiv Detail & Related papers (2024-06-17T11:37:48Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Translation-Enhanced Multilingual Text-to-Image Generation [61.41730893884428]
Research on text-to-image generation (TTI) still predominantly focuses on the English language.
In this work, we thus investigate multilingual TTI and the current potential of neural machine translation (NMT) to bootstrap mTTI systems.
We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework.
arXiv Detail & Related papers (2023-05-30T17:03:52Z) - E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine
Translation [40.62692548291319]
Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language.
Existing methods, both two-stage cascade and one-stage end-to-end architectures, suffer from different issues.
We propose an end-to-end TIMT model fully making use of the knowledge from existing OCR and MT datasets.
arXiv Detail & Related papers (2023-05-09T04:25:52Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Improving End-to-End Text Image Translation From the Auxiliary Text
Translation Task [26.046624228278528]
We propose a novel text translation enhanced text image translation, which trains the end-to-end model with text translation as an auxiliary task.
By sharing model parameters and multi-task training, our model is able to take full advantage of easily-available large-scale text parallel corpus.
arXiv Detail & Related papers (2022-10-08T02:35:45Z) - WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual
Machine Learning [19.203716881791312]
We introduce the Wikipedia-based Image Text (WIT) dataset.
WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages.
WIT is the largest multimodal dataset by the number of image-text examples by 3x.
arXiv Detail & Related papers (2021-03-02T18:13:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.