Related papers: MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

URL: http://arxiv.org/abs/2502.16781v1
Date: Mon, 24 Feb 2025 02:16:37 GMT
Title: MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts
Authors: Bhawna Piryani, Jamshid Mozafari, Abdelrahman Abdallah, Antoine Doucet, Adam Jatowt,
Abstract summary: We introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance.<n>The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German.<n>Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.
Score: 17.20084584886653
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance. The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate MultiOCR-QA on various levels and types of OCR errors to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.

Related papers

VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding [49.07705729597171]
VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
arXiv Detail & Related papers (2025-08-10T21:44:43Z)
Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues? [73.35232225256968]
Reasoning-OCR challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text.<n>Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges.
arXiv Detail & Related papers (2025-05-19T06:45:18Z)
Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval [38.569818461453394]
Retrieval-Augmented Generation (RAG) is a technique for grounding responses in external documents.<n>Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text.<n>Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR.
arXiv Detail & Related papers (2025-05-08T21:54:02Z)
Ocean-OCR: Towards General OCR Application via a Vision-Language Model [6.70908296002235]
We present textbfOcean-OCR, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks.<n>We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios.
arXiv Detail & Related papers (2025-01-26T15:20:39Z)
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z)
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage. CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers. Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z)
CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C) It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z)
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering [8.382903851560595]
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems. We propose a multimodal adversarial training architecture with spatial awareness capabilities.
arXiv Detail & Related papers (2024-03-14T11:22:06Z)
LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z)
OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages. We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting. We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z)
Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting [56.268862325167575]
We tackle conversational passage retrieval (ConvPR) with query reformulation integrated into a multi-stage ad-hoc IR system. We propose two conversational query reformulation (CQR) methods: (1) term importance estimation and (2) neural query rewriting. For the former, we expand conversational queries using important terms extracted from the conversational context with frequency-based signals. For the latter, we reformulate conversational queries into natural, standalone, human-understandable queries with a pretrained sequence-tosequence model.
arXiv Detail & Related papers (2020-05-05T14:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.