TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
Different Modalities as Different Languages
- URL: http://arxiv.org/abs/2402.16021v1
- Date: Sun, 25 Feb 2024 07:46:57 GMT
- Title: TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
Different Modalities as Different Languages
- Authors: Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora,
Xuankai Chang, Shinji Watanabe, Yong Man Ro
- Abstract summary: Tri-Modal Translation (TMT) model translates between arbitrary modalities spanning speech, image, and text.
We tokenize speech and image data into discrete tokens, which provide a unified interface across modalities.
TMT outperforms single model counterparts consistently.
- Score: 96.8603701943286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The capability to jointly process multi-modal information is becoming an
essential task. However, the limited number of paired multi-modal data and the
large computational requirements in multi-modal learning hinder the
development. We propose a novel Tri-Modal Translation (TMT) model that
translates between arbitrary modalities spanning speech, image, and text. We
introduce a novel viewpoint, where we interpret different modalities as
different languages, and treat multi-modal translation as a well-established
machine translation problem. To this end, we tokenize speech and image data
into discrete tokens, which provide a unified interface across modalities and
significantly decrease the computational cost. In the proposed TMT, a
multi-modal encoder-decoder conducts the core translation, whereas
modality-specific processing is conducted only within the tokenization and
detokenization stages. We evaluate the proposed TMT on all six modality
translation tasks. TMT outperforms single model counterparts consistently,
demonstrating that unifying tasks is beneficial not only for practicality but
also for performance.
Related papers
- EMMeTT: Efficient Multimodal Machine Translation Training [26.295981183965566]
We propose a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST)
To handle joint multimodal training, we propose a novel training framework called EMMeTT.
The resultant Multimodal Translation Model produces strong text and speech translation results at the same time.
arXiv Detail & Related papers (2024-09-20T14:03:23Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Tackling Ambiguity with Images: Improved Multimodal Machine Translation
and Contrastive Evaluation [72.6667341525552]
We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism.
We also introduce CoMMuTE, a Contrastive Multimodal Translation Evaluation set of ambiguous sentences and their possible translations.
Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks.
arXiv Detail & Related papers (2022-12-20T10:18:18Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z) - Towards Multimodal Simultaneous Neural Machine Translation [28.536262015508722]
Simultaneous translation involves translating a sentence before the speaker's utterance is completed in order to realize real-time understanding.
This task is significantly more challenging than the general full sentence translation because of the shortage of input information during decoding.
We propose multimodal simultaneous neural machine translation (MSNMT), which leverages visual information as an additional modality.
arXiv Detail & Related papers (2020-04-07T08:02:21Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.