DocLangID: Improving Few-Shot Training to Identify the Language of
Historical Documents
- URL: http://arxiv.org/abs/2305.02208v1
- Date: Wed, 3 May 2023 15:45:30 GMT
- Title: DocLangID: Improving Few-Shot Training to Identify the Language of
Historical Documents
- Authors: Furkan Simsek, Brian Pfitzmann, Hendrik Raetz, Jona Otholt, Haojin
Yang, Christoph Meinel
- Abstract summary: Language identification describes the task of recognizing the language of written text in documents.
We propose DocLangID, a transfer learning approach to identify the language of unlabeled historical documents.
- Score: 7.535751594024775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language identification describes the task of recognizing the language of
written text in documents. This information is crucial because it can be used
to support the analysis of a document's vocabulary and context. Supervised
learning methods in recent years have advanced the task of language
identification. However, these methods usually require large labeled datasets,
which often need to be included for various domains of images, such as
documents or scene images. In this work, we propose DocLangID, a transfer
learning approach to identify the language of unlabeled historical documents.
We achieve this by first leveraging labeled data from a different but related
domain of historical documents. Secondly, we implement a distance-based
few-shot learning approach to adapt a convolutional neural network to new
languages of the unlabeled dataset. By introducing small amounts of manually
labeled examples from the set of unlabeled images, our feature extractor
develops a better adaptability towards new and different data distributions of
historical documents. We show that such a model can be effectively fine-tuned
for the unlabeled set of images by only reusing the same few-shot examples. We
showcase our work across 10 languages that mostly use the Latin script. Our
experiments on historical documents demonstrate that our combined approach
improves the language identification performance, achieving 74% recognition
accuracy on the four unseen languages of the unlabeled dataset.
Related papers
- A Multi-Modal Multilingual Benchmark for Document Image Classification [21.7518357653137]
We introduce two newly curated multilingual datasets WIKI-DOC and MULTIEUR-DOCLEX.
We study popular visually-rich document understanding or Document AI models in previously untested setting in document image classification.
Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages.
arXiv Detail & Related papers (2023-10-25T04:35:06Z) - Multimodal Modeling For Spoken Language Identification [57.94119986116947]
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance.
We propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification.
arXiv Detail & Related papers (2023-09-19T12:21:39Z) - CSSL-MHTR: Continual Self-Supervised Learning for Scalable Multi-script Handwritten Text Recognition [16.987008461171065]
We explore the potential of continual self-supervised learning to alleviate the catastrophic forgetting problem in handwritten text recognition.
Our method consists in adding intermediate layers called adapters for each task, and efficiently distilling knowledge from the previous model while learning the current task.
We attain state-of-the-art performance on English, Italian and Russian scripts, whilst adding only a few parameters per task.
arXiv Detail & Related papers (2023-03-16T14:27:45Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Handwriting Classification for the Analysis of Art-Historical Documents [6.918282834668529]
We focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI.
We propose a handwriting classification model that labels extracted text fragments based on their visual structure.
arXiv Detail & Related papers (2020-11-04T13:06:46Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z) - Massively Multilingual Document Alignment with Cross-lingual
Sentence-Mover's Distance [8.395430195053061]
Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other.
We develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages.
These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs.
arXiv Detail & Related papers (2020-01-31T05:14:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.