User-Centric Evaluation of OCR Systems for Kwak'wala
- URL: http://arxiv.org/abs/2302.13410v1
- Date: Sun, 26 Feb 2023 21:41:15 GMT
- Title: User-Centric Evaluation of OCR Systems for Kwak'wala
- Authors: Shruti Rijhwani, Daisy Rosenblum, Michayla King, Antonios
Anastasopoulos, Graham Neubig
- Abstract summary: We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
- Score: 92.73847703011353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been recent interest in improving optical character recognition
(OCR) for endangered languages, particularly because a large number of
documents and books in these languages are not in machine-readable formats. The
performance of OCR systems is typically evaluated using automatic metrics such
as character and word error rates. While error rates are useful for the
comparison of different models and systems, they do not measure whether and how
the transcriptions produced from OCR tools are useful to downstream users. In
this paper, we present a human-centric evaluation of OCR systems, focusing on
the Kwak'wala language as a case study. With a user study, we show that
utilizing OCR reduces the time spent in the manual transcription of culturally
valuable documents -- a task that is often undertaken by endangered language
community members and researchers -- by over 50%. Our results demonstrate the
potential benefits that OCR tools can have on downstream language documentation
and revitalization efforts.
Related papers
- CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C)
It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality.
The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - TransDocs: Optical Character Recognition with word to word translation [2.2336243882030025]
This research work focuses on improving the optical character recognition (OCR) with ML techniques.
This work is based on ANKI dataset for English to Spanish translation.
arXiv Detail & Related papers (2023-04-15T21:40:14Z) - OCR Improves Machine Translation for Low-Resource Languages [10.010595434359647]
We introduce and make publicly available a novel benchmark, textscOCR4MT, consisting of real and synthetic data, enriched with noise.
We evaluate state-of-the-art OCR systems on our benchmark and analyse most common errors.
We then perform an ablation study to investigate how OCR errors impact Machine Translation performance.
arXiv Detail & Related papers (2022-02-27T02:36:45Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.