OCR Post Correction for Endangered Language Texts
- URL: http://arxiv.org/abs/2011.05402v1
- Date: Tue, 10 Nov 2020 21:21:08 GMT
- Title: OCR Post Correction for Endangered Language Texts
- Authors: Shruti Rijhwani, Antonios Anastasopoulos, Graham Neubig
- Abstract summary: We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
- Score: 113.8242302688894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is little to no data available to build natural language processing
models for most endangered languages. However, textual data in these languages
often exists in formats that are not machine-readable, such as paper books and
scanned images. In this work, we address the task of extracting text from these
resources. We create a benchmark dataset of transcriptions for scanned books in
three critically endangered languages and present a systematic analysis of how
general-purpose OCR tools are not robust to the data-scarce setting of
endangered languages. We develop an OCR post-correction method tailored to ease
training in this data-scarce setting, reducing the recognition error rate by
34% on average across the three languages.
Related papers
- DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives [6.599829213637133]
Indonesia is one of the most diverse countries linguistically.
Despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing research and technologies.
We propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build digital language resources in Indonesia.
arXiv Detail & Related papers (2024-11-14T10:00:33Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Noisy Parallel Data Alignment [36.578851892373365]
We study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data.
Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.
arXiv Detail & Related papers (2023-01-23T19:26:34Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.