E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine
Translation
- URL: http://arxiv.org/abs/2305.05166v2
- Date: Wed, 10 May 2023 02:37:32 GMT
- Title: E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine
Translation
- Authors: Cong Ma, Yaping Zhang, Mei Tu, Yang Zhao, Yu Zhou, Chengqing Zong
- Abstract summary: Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language.
Existing methods, both two-stage cascade and one-stage end-to-end architectures, suffer from different issues.
We propose an end-to-end TIMT model fully making use of the knowledge from existing OCR and MT datasets.
- Score: 40.62692548291319
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text image machine translation (TIMT) aims to translate texts embedded in
images from one source language to another target language. Existing methods,
both two-stage cascade and one-stage end-to-end architectures, suffer from
different issues. The cascade models can benefit from the large-scale optical
character recognition (OCR) and MT datasets but the two-stage architecture is
redundant. The end-to-end models are efficient but suffer from training data
deficiency. To this end, in our paper, we propose an end-to-end TIMT model
fully making use of the knowledge from existing OCR and MT datasets to pursue
both an effective and efficient framework. More specifically, we build a novel
modal adapter effectively bridging the OCR encoder and MT decoder. End-to-end
TIMT loss and cross-modal contrastive loss are utilized jointly to align the
feature distribution of the OCR and MT tasks. Extensive experiments show that
the proposed method outperforms the existing two-stage cascade models and
one-stage end-to-end models with a lighter and faster architecture.
Furthermore, the ablation studies verify the generalization of our method,
where the proposed modal adapter is effective to bridge various OCR and MT
models.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - Exploring Better Text Image Translation with Multimodal Codebook [39.12169843196739]
Text image translation (TIT) aims to translate the source texts embedded in the image to target translations.
In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies.
Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts.
We present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts.
arXiv Detail & Related papers (2023-05-27T08:41:18Z) - Multi-Teacher Knowledge Distillation For Text Image Machine Translation [40.62692548291319]
We propose a novel Multi-Teacher Knowledge Distillation (MTKD) method to effectively distillate knowledge into the end-to-end TIMT model from the pipeline model.
Our proposed MTKD effectively improves the text image translation performance and outperforms existing end-to-end and pipeline models.
arXiv Detail & Related papers (2023-05-09T07:41:17Z) - Improving Cross-modal Alignment for Text-Guided Image Inpainting [36.1319565907582]
Text-guided image inpainting (TGII) aims to restore missing regions based on a given text in a damaged image.
We propose a novel model for TGII by improving cross-modal alignment.
Our model achieves state-of-the-art performance compared with other strong competitors.
arXiv Detail & Related papers (2023-01-26T19:18:27Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z) - Tight Integrated End-to-End Training for Cascaded Speech Translation [40.76367623739673]
A cascaded speech translation model relies on discrete and non-differentiable transcription.
Direct speech translation is an alternative method to avoid error propagation.
This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model.
arXiv Detail & Related papers (2020-11-24T15:43:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.