Fugu-MT 論文翻訳(概要): UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation

論文の概要: UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation

arxiv url: http://arxiv.org/abs/2606.24333v1
Date: Tue, 23 Jun 2026 09:11:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.857332
Title: UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation
Title（参考訳）: UniTranslator: エンドツーエンドのインイメージ機械翻訳のための統合マルチモーダルフレームワーク
Authors: Jiahao Lyu, Pei Fu, Zhenhang Li, Shaojie Zhang, Jiahui Yang, Yu Zhou, Can Ma, Zhenbo Luo, Jian Luan,
Abstract要約: In-Image Machine Translation (IIMT)は、画像中のシーンテキストを翻訳し、翻訳されたテキストを元の領域に戻し、全体の視覚的外観を保存することを目的としている。最近の統合マルチモーダルモデルは、単一のフレームワーク内で視覚的テキスト理解と画像生成を組み合わせることで、有望なソリューションを提供する。 We present UniTranslator, a unified multimodal framework for IIMT that tightly couples translation understanding and text editing。
参考スコア（独自算出の注目度）: 23.787128107000374
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In-Image Machine Translation (IIMT) aims to translate scene text in an image and render the translated text back into the original regions while preserving the overall visual appearance. Recent unified multimodal models provide a promising solution by combining visual-text understanding and image generation within a single framework. However, directly adapting such models to IIMT remains challenging. In particular, they often suffer from understanding-generation conflicts, where the translation inferred during understanding is inconsistent with the text supervision used in generation, and spatial position misalignment, where the rendered text does not accurately match the target text regions. To address these issues, we present UniTranslator, a unified multimodal framework for IIMT that tightly couples translation understanding and text editing. Specifically, we introduce an Understand-Generation Alignment Module (UGAM) to bridge the representation gap between understanding and generation, encouraging semantic consistency between translated content prediction and text rendering. We further propose a Spatial Mask Decoder (SMD) with pixel-level supervision over text regions to improve spatial grounding, geometric alignment, and layout controllability during generation. Extensive experiments on multiple benchmarks demonstrate that UniTranslator achieves state-of-the-art performance across diverse language directions and complex real-world layouts. Moreover, our results reveal a strong mutual reinforcement effect between translation understanding and image generation, highlighting the advantage of unified translation multimodal learning. Code is available at https://github.com/SeerRay-Lab/Unitranslator.
Abstract（参考訳）: In-Image Machine Translation (IIMT)は、画像中のシーンテキストを翻訳し、翻訳されたテキストを元の領域に戻し、全体の視覚的外観を保存することを目的としている。最近の統合マルチモーダルモデルは、単一のフレームワーク内で視覚的テキスト理解と画像生成を組み合わせることで、有望なソリューションを提供する。しかし、そのようなモデルをIIMTに直接適用することは依然として困難である。特に、理解中に推測される翻訳が、生成時に使用されるテキストの監督と矛盾する理解世代間の紛争や、レンダリングされたテキストがターゲットのテキスト領域と正確に一致しない空間的位置のずれに悩まされることが多い。これらの問題に対処するため、IIMT用の統合マルチモーダルフレームワークであるUniTranslatorを紹介し、翻訳理解とテキスト編集を密に結合する。具体的には、理解と生成の間の表現ギャップを埋め、翻訳されたコンテンツ予測とテキストレンダリングのセマンティック一貫性を促進するために、アンダースタンド・ジェネレーション・アライメント・モジュール(UGAM)を導入する。さらに,テキスト領域に対してピクセルレベルの監督を施した空間マスクデコーダ(SMD)を提案し,生成時の空間接地,幾何学的アライメント,レイアウト制御性を改善する。複数のベンチマークでの大規模な実験により、UniTranslatorは様々な言語方向と複雑な実世界のレイアウトで最先端のパフォーマンスを達成することが示された。さらに, 翻訳理解と画像生成の相互強化効果が強く, 統合翻訳マルチモーダル学習の利点を浮き彫りにした。コードはhttps://github.com/SeerRay-Lab/Unitranslator.comで入手できる。

論文の概要: UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation

関連論文リスト