Fugu-MT 論文翻訳(概要): VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

論文の概要: VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

arxiv url: http://arxiv.org/abs/2605.24675v1
Date: Sat, 23 May 2026 17:25:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.307394
Title: VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation
Title（参考訳）: VaaWIT:多言語Web画像翻訳のための大規模言語モデルの視覚的適応
Authors: Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen,
Abstract要約: VaaWITは、多言語Web画像変換にLarge Language Modelsを適用するエンドツーエンドフレームワークである。 Dual-Stream Attention Module (DSAM)は、多言語の意味的特徴と詳細な視覚的表現の間の双方向の相互作用を容易にする。 VAA(Visual-Aware Adapter)は、これらの融合した視覚的手がかりを冷凍LDMバックボーンに動的に注入する。
参考スコア（独自算出の注目度）: 18.312531006938162
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.
Abstract（参考訳）: Web画像に埋め込まれたテキストの翻訳は、特にソーシャルメディアやeコマース領域において、コンテンツアクセシビリティと言語間情報検索の改善に不可欠である。 LVLM(Large Vision-Language Models)には高度なマルチモーダル理解があるが、視覚的表現のギャップのためにWebイメージ翻訳に適用することは依然として困難である。この課題に対処するため,多言語Web画像翻訳に大規模言語モデルを適用するエンドツーエンドフレームワークであるVaWITを提案する。このフレームワークは,(1)多言語意味的特徴と詳細な視覚表現の双方向相互作用を促進するDSAM(Dual-Stream Attention Module),(2)パラメータ効率の良い微調整戦略であるVisual-Aware Adapter(VAA)の2つの技術的貢献を紹介する。この設計により、計算コストを最小化しながら、視覚コンテキストと言語推論を効果的に整合させることができる。 3つの公開ベンチマーク上の8つのタスクに関する大規模な実験は、VaaWITがオープンソースベースライン(SOTA)を著しく上回り、プロプライエタリなモデルと競合するパフォーマンスを実現していることを示している。これらの結果から,複雑なWebコンテンツ分析のための細粒度視覚認識をLCMに組み込むことの有効性が検証された。

論文の概要: VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

関連論文リスト