Applications of Machine Learning in Document Digitisation
- URL: http://arxiv.org/abs/2102.03239v1
- Date: Fri, 5 Feb 2021 15:35:28 GMT
- Title: Applications of Machine Learning in Document Digitisation
- Authors: Christian M. Dahl, Torben S. D. Johansen, Emil N. S{\o}rensen,
Christian E. Westermann and Simon F. Wittrock
- Abstract summary: We advocate the use of modern machine learning techniques to automate the digitisation process.
We give an overview of the potential for applying machine digitisation for data collection through two illustrative applications.
The first demonstrates that unsupervised layout classification applied to raw scans of nurse journals can be used to construct a treatment indicator.
The second application uses attention-based neural networks for handwritten text recognition in order to transcribe age and birth and death dates from a large collection of Danish death certificates.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data acquisition forms the primary step in all empirical research. The
availability of data directly impacts the quality and extent of conclusions and
insights. In particular, larger and more detailed datasets provide convincing
answers even to complex research questions. The main problem is that 'large and
detailed' usually implies 'costly and difficult', especially when the data
medium is paper and books. Human operators and manual transcription have been
the traditional approach for collecting historical data. We instead advocate
the use of modern machine learning techniques to automate the digitisation
process. We give an overview of the potential for applying machine digitisation
for data collection through two illustrative applications. The first
demonstrates that unsupervised layout classification applied to raw scans of
nurse journals can be used to construct a treatment indicator. Moreover, it
allows an assessment of assignment compliance. The second application uses
attention-based neural networks for handwritten text recognition in order to
transcribe age and birth and death dates from a large collection of Danish
death certificates. We describe each step in the digitisation pipeline and
provide implementation insights.
Related papers
- Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - A Novel Dataset for Non-Destructive Inspection of Handwritten Documents [0.0]
Forensic handwriting examination aims to examine handwritten documents in order to properly define or hypothesize the manuscript's author.
We propose a new and challenging dataset consisting of two subsets: the first consists of 21 documents written either by the classic pen and paper" approach (and later digitized) and directly acquired on common devices such as tablets.
Preliminary results on the proposed datasets show that 90% classification accuracy can be achieved on the first subset.
arXiv Detail & Related papers (2024-01-09T09:25:58Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Questions Are All You Need to Train a Dense Passage Retriever [123.13872383489172]
ART is a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data.
It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question.
arXiv Detail & Related papers (2022-06-21T18:16:31Z) - Toward Educator-focused Automated Scoring Systems for Reading and
Writing [0.0]
This paper addresses the challenges of data and label availability, authentic and extended writing, domain scoring, prompt and source variety, and transfer learning.
It employs techniques that preserve essay length as an important feature without increasing model training costs.
arXiv Detail & Related papers (2021-12-22T15:44:30Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - Small data problems in political research: a critical replication study [5.698280399449707]
We show that the small data causes the classification model to be highly sensitive to variations in the random train-test split.
We also show that the applied preprocessing causes the data to be extremely sparse.
Based on our findings, we argue that A&W's conclusions regarding the automated classification of organizational reputation tweets can not be maintained.
arXiv Detail & Related papers (2021-09-27T09:55:58Z) - Scaling Systematic Literature Reviews with Machine Learning Pipelines [57.82662094602138]
Systematic reviews entail the extraction of data from scientific documents.
We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs.
We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation.
arXiv Detail & Related papers (2020-10-09T16:19:42Z) - Machine Identification of High Impact Research through Text and Image
Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z) - Laplacian Denoising Autoencoder [114.21219514831343]
We propose to learn data representations with a novel type of denoising autoencoder.
The noisy input data is generated by corrupting latent clean data in the gradient domain.
Experiments on several visual benchmarks demonstrate that better representations can be learned with the proposed approach.
arXiv Detail & Related papers (2020-03-30T16:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.