OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion
- URL: http://arxiv.org/abs/2512.00234v1
- Date: Fri, 28 Nov 2025 22:39:12 GMT
- Title: OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion
- Authors: Sai Koneru, Matthias Huck, Jan Niehues,
- Abstract summary: We propose an end-to-end approach to build an effective multimodal translation system.<n>We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM.<n>The resulting model, OmniFusion, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation.
- Score: 14.856747950038553
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation quality\footnote{Code is available at https://github.com/saikoneru/OmniFusion}.
Related papers
- Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion [42.60008616386837]
Speech-guided Machine Translation (SMT) framework integrates speech and text as fused inputs into an MLLM to improve translation quality.<n>Core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples.
arXiv Detail & Related papers (2026-02-25T07:19:34Z) - End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs [0.3867363075280544]
Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language.<n>This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously.
arXiv Detail & Related papers (2025-10-11T20:10:30Z) - LMFusion: Adapting Pretrained Language Models for Multimodal Generation [81.78257799283777]
We present LMFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities.<n>Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LMFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs.
arXiv Detail & Related papers (2024-12-19T18:56:24Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.56746545958522]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.<n>We build a multimodal text-centric dataset for multimodal alignment pre-training.<n>We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM)
By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations.
Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z) - OneLLM: One Framework to Align All Modalities with Language [86.8818857465443]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework.<n>OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for
Multimodal Machine Translation [31.911593690549633]
multimodal machine translation (MMT) systems enhance neural machine translation (NMT) with visual knowledge.
Previous works face a challenge in training powerful MMT models from scratch due to the scarcity of annotated multilingual vision-language data.
We propose CLIPTrans, which simply adapts the independently pre-trained multimodal M-CLIP and the multilingual mBART.
arXiv Detail & Related papers (2023-08-29T11:29:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.