Related papers: Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency

Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency

URL: http://arxiv.org/abs/2507.08309v1
Date: Fri, 11 Jul 2025 05:02:06 GMT
Title: Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency
Authors: Yupu Liang, Yaping Zhang, Zhiyang Zhang, Zhiyuan Chen, Yang Zhao, Lu Xiang, Chengqing Zong, Yu Zhou,
Abstract summary: Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks.<n>They struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges.<n>We introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage"
Score: 31.095908827004695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model's existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage". Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks.

Related papers

Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues? [73.35232225256968]
Reasoning-OCR challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text.<n>Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges.
arXiv Detail & Related papers (2025-05-19T06:45:18Z)
VISTA-OCR: Towards generative and interactive end to end OCR models [3.7548609506798494]
VISTA-OCR is a lightweight architecture that unifies text detection and recognition within a single generative model.<n>Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase.<n>To enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples.
arXiv Detail & Related papers (2025-04-04T17:39:53Z)
Ocean-OCR: Towards General OCR Application via a Vision-Language Model [6.70908296002235]
We present textbfOcean-OCR, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks.<n>We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios.
arXiv Detail & Related papers (2025-01-26T15:20:39Z)
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z)
SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition [55.2480439325792]
Speech Meta In-Context LEarning (SMILE) is an innovative framework that combines meta-learning with speech in-context learning (SICL)<n>We show that SMILE consistently outperforms baseline methods in training-free few-shot multilingual ASR tasks.
arXiv Detail & Related papers (2024-09-16T16:04:16Z)
EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package. It meets both the computational and sample efficiency requirements for liberating texts at scale. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
TransDocs: Optical Character Recognition with word to word translation [2.2336243882030025]
This research work focuses on improving the optical character recognition (OCR) with ML techniques. This work is based on ANKI dataset for English to Spanish translation.
arXiv Detail & Related papers (2023-04-15T21:40:14Z)
Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation [72.6667341525552]
We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism. We also introduce CoMMuTE, a Contrastive Multimodal Translation Evaluation set of ambiguous sentences and their possible translations. Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks.
arXiv Detail & Related papers (2022-12-20T10:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.