Fugu-MT 論文翻訳(概要): Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

論文の概要: Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2605.16409v1
Date: Wed, 13 May 2026 14:16:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:46.301993
Title: Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models
Title（参考訳）: 多言語大言語モデルのための多言語OCR対応ファインチューニングとPrompt-Guided Chain-of-Thought Reasoning
Authors: Qinwu Xu, Xin Liu, Yifan Jiang, Haoyu Ren,
Abstract要約: 光文字認識(OCR)と多言語テキスト理解は、マルチモーダル大言語モデル(MLLM)の主要な障害モードのままである我々は,大規模合成OCR-to-translationデータ生成,OCR-awareによる教師付き微調整,構造化された視覚連鎖とを組み合わせた,OCR対応多言語多言語学習フレームワークを提案する。 LLaMAベースのマルチモーダルアーキテクチャを用いて、劣化した視覚条件下でのOCR完全性、多言語翻訳精度、ロバスト性を大幅に改善する。
参考スコア（独自算出の注目度）: 7.833222732846266
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems.
Abstract（参考訳）: 光文字認識(OCR)と多言語テキスト理解(Multilingual text understanding)は、特に乱雑なレイアウト、小さなフォント、ぼやけ、閉塞、複雑なタイポグラフィを含む実世界の画像において、MLLM(Multimodal large language model)の主要な障害モードのままである。我々はOCR対応多言語マルチモーダルトレーニングフレームワークを提案する。 (i)大規模合成OCR-to-translationデータ生成 (II) LoRA適応によるOCR対応微調整(SFT) 三構造的視覚連鎖(CoT)は、不確実な視覚条件下での推論を促す。 LLaMAベースのマルチモーダルアーキテクチャを用いて、劣化した視覚条件下でのOCR完全性、多言語翻訳精度、ロバスト性を大幅に改善する。多言語レシート、メニュー、ポスター、サイン、手書きテキスト、文書画像による実験結果は、ベースラインモデルと比較して、視覚テキストのグラウンドニングが著しく改善されたことを示している。特に、提案するOCR学習後フレームワークは、不確実なOCR条件下での言語事前依存性を低減しつつ、小さく、ぼやけた、空間的に散らばった、あるいは部分的に隠蔽されたテキストの抽出を改善する。 GPT-5クラスやジェミニファミリーモデルを含むフロンティアのマルチモーダルシステムとの質的な比較は、ノイズや視覚的曖昧なOCRシナリオ下でのOCR接地と幻覚の低減をさらに改善することを示唆している。その結果、データ中心のOCR対応マルチモーダルポストトレーニングは、多言語OCRとOCRに基づく視覚的質問応答システムを改善する上で、効果的でスケーラブルな方向を提供することが示された。

論文の概要: Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

関連論文リスト