GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation
- URL: http://arxiv.org/abs/2507.18562v1
- Date: Thu, 24 Jul 2025 16:36:47 GMT
- Title: GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation
- Authors: Jiafeng Xiong, Yuting Zhao,
- Abstract summary: We construct novel multimodal scene graphs to preserve and integrate modality-specific information.<n>We introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework.<n>Results on the WMT benchmark show significant improvements over the image-free translation baselines.
- Score: 0.9208007322096533
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.
Related papers
- Dual-branch Prompting for Multimodal Machine Translation [9.903997553625253]
We propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation.<n>D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model.<n>Experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-07-23T15:22:51Z) - Multimodal Machine Translation with Visual Scene Graph Pruning [31.85382347738067]
Multimodal machine translation (MMT) seeks to address the challenges posed by linguistic polysemy and ambiguity in translation tasks by incorporating visual information.<n>We introduce a novel approach--multimodal machine translation with visual Scene Graph Pruning (PSG)<n>PSG leverages language scene graph information to guide the pruning of redundant nodes in visual scene graphs, thereby reducing noise in downstream translation tasks.
arXiv Detail & Related papers (2025-05-26T04:35:03Z) - Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation [40.42326040668964]
We introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence.<n>We build human feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence.<n> Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT.
arXiv Detail & Related papers (2024-12-17T07:41:23Z) - TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages [92.86083489187403]
Tri-Modal Translation (TMT) model translates between arbitrary modalities spanning speech, image, and text.<n>We tokenize speech and image data into discrete tokens, which provide a unified interface across modalities.<n>TMT outperforms single model counterparts consistently.
arXiv Detail & Related papers (2024-02-25T07:46:57Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Tackling Ambiguity with Images: Improved Multimodal Machine Translation
and Contrastive Evaluation [72.6667341525552]
We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism.
We also introduce CoMMuTE, a Contrastive Multimodal Translation Evaluation set of ambiguous sentences and their possible translations.
Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks.
arXiv Detail & Related papers (2022-12-20T10:18:18Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.