Predicting the Ordering of Characters in Japanese Historical Documents
- URL: http://arxiv.org/abs/2106.06786v1
- Date: Sat, 12 Jun 2021 14:39:20 GMT
- Title: Predicting the Ordering of Characters in Japanese Historical Documents
- Authors: Alex Lamb, Tarin Clanuwat, Siyu Han, Mikel Bober-Irizar, Asanobu
Kitamoto
- Abstract summary: Change in Japanese writing system in 1900 made historical documents inaccessible for the general public.
We explore a few approaches to the task of predicting the sequential ordering of the characters.
Our best-performing system has an accuracy of 98.65% and has a perfect accuracy on 49% of the books in our dataset.
- Score: 6.82324732276004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Japan is a unique country with a distinct cultural heritage, which is
reflected in billions of historical documents that have been preserved.
However, the change in Japanese writing system in 1900 made these documents
inaccessible for the general public. A major research project has been to make
these historical documents accessible and understandable. An increasing amount
of research has focused on the character recognition task and the location of
characters on image, yet less research has focused on how to predict the
sequential ordering of the characters. This is because sequence in classical
Japanese is very different from modern Japanese. Ordering characters into a
sequence is important for making the document text easily readable and
searchable. Additionally, it is a necessary step for any kind of natural
language processing on the data (e.g. machine translation, language modeling,
and word embeddings). We explore a few approaches to the task of predicting the
sequential ordering of the characters: one using simple hand-crafted rules,
another using hand-crafted rules with adaptive thresholds, and another using a
deep recurrent sequence model trained with teacher forcing. We provide a
quantitative and qualitative comparison of these techniques as well as their
distinct trade-offs. Our best-performing system has an accuracy of 98.65\% and
has a perfect accuracy on 49\% of the books in our dataset, suggesting that the
technique is able to predict the order of the characters well enough for many
tasks.
Related papers
- Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names [53.24414727354768]
This paper aims to generate a dialogue transcript of a complete manga chapter, entirely automatically.
It involves identifying (i) what is being said, detecting the texts on each page and classifying them into essential vs non-essential.
It also ensures the same characters are named consistently throughout the chapter.
arXiv Detail & Related papers (2024-08-01T05:47:04Z) - Don't lose the message while paraphrasing: A study on content preserving
style transfer [61.38460184163704]
Content preservation is critical for real-world applications of style transfer studies.
We compare various style transfer models on the example of the formality transfer domain.
We conduct a precise comparative study of several state-of-the-art techniques for style transfer.
arXiv Detail & Related papers (2023-08-17T15:41:08Z) - Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models [17.749113496737106]
We construct the first Classical-Chinese-to-Kanbun dataset in the world.
Character reordering and machine translation play a significant role in Kanbun comprehension.
We release our code and dataset on GitHub.
arXiv Detail & Related papers (2023-05-22T06:30:02Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Restoring and Mining the Records of the Joseon Dynasty via Neural
Language Modeling and Machine Translation [20.497110880878544]
We present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism.
Our approach significantly improves the accuracy of the translation task than baselines without multi-task learning.
arXiv Detail & Related papers (2021-04-13T06:40:25Z) - Deep Learning for Text Style Transfer: A Survey [71.8870854396927]
Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text.
We present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017.
We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data.
arXiv Detail & Related papers (2020-11-01T04:04:43Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - Automated Transcription for Pre-Modern Japanese Kuzushiji Documents by
Random Lines Erasure and Curriculum Learning [6.700873164609009]
Most of the previous methods divided the recognition process into character segmentation and recognition.
In this paper, we enlarge our previous humaninspired recognition system from multiple lines to the full-page of Kuzushiji documents.
For the lack of training data, we propose a random text line erasure approach that randomly erases text lines and distorts documents.
arXiv Detail & Related papers (2020-05-06T09:17:28Z) - KaoKore: A Pre-modern Japanese Art Facial Expression Dataset [8.987910033541239]
We propose a new dataset KaoKore which consists of faces extracted from pre-modern Japanese artwork.
We demonstrate its value as both a dataset for image classification as well as a creative and artistic dataset, which we explore using generative models.
arXiv Detail & Related papers (2020-02-20T07:22:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.