A Framework For Refining Text Classification and Object Recognition from Academic Articles
- URL: http://arxiv.org/abs/2305.17401v4
- Date: Wed, 3 Jul 2024 01:46:32 GMT
- Title: A Framework For Refining Text Classification and Object Recognition from Academic Articles
- Authors: Jinghong Li, Koichi Ota, Wen Gu, Shinobu Hasegawa,
- Abstract summary: Current data mining methods for academic articles employ rule-based(RB) or machine learning(ML) approaches.
We have developed a novel Text Block Refinement Framework (TBRF), a machine learning and rule-based scheme hybrid.
- Score: 2.699900017799093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the widespread use of the internet, it has become increasingly crucial to extract specific information from vast amounts of academic articles efficiently. Data mining techniques are generally employed to solve this issue. However, data mining for academic articles is challenging since it requires automatically extracting specific patterns in complex and unstructured layout documents. Current data mining methods for academic articles employ rule-based(RB) or machine learning(ML) approaches. However, using rule-based methods incurs a high coding cost for complex typesetting articles. On the other hand, simply using machine learning methods requires annotation work for complex content types within the paper, which can be costly. Furthermore, only using machine learning can lead to cases where patterns easily recognized by rule-based methods are mistakenly extracted. To overcome these issues, from the perspective of analyzing the standard layout and typesetting used in the specified publication, we emphasize implementing specific methods for specific characteristics in academic articles. We have developed a novel Text Block Refinement Framework (TBRF), a machine learning and rule-based scheme hybrid. We used the well-known ACL proceeding articles as experimental data for the validation experiment. The experiment shows that our approach achieved over 95% classification accuracy and 90% detection accuracy for tables and figures.
Related papers
- Topological Methods in Machine Learning: A Tutorial for Practitioners [4.297070083645049]
Topological Machine Learning (TML) is an emerging field that leverages techniques from algebraic topology to analyze complex data structures.
This tutorial provides a comprehensive introduction to two key TML techniques, persistent homology and the Mapper algorithm.
To enhance accessibility, we adopt a data-centric approach, enabling readers to gain hands-on experience applying these techniques to relevant tasks.
arXiv Detail & Related papers (2024-09-04T17:44:52Z) - Deep Learning-Driven Approach for Handwritten Chinese Character Classification [0.0]
Handwritten character recognition is a challenging problem for machine learning researchers.
With numerous unique character classes present, some data, such as Logographic Scripts or Sino-Korean character sequences, bring new complications to the HCR problem.
This paper proposes a highly scalable approach for detailed character image classification by introducing the model architecture, data preprocessing steps, and testing design instructions.
arXiv Detail & Related papers (2024-01-30T15:29:32Z) - Object Recognition from Scientific Document based on Compartment Refinement Framework [2.699900017799093]
It has become increasingly important to extract valuable information from vast resources efficiently.
Current data extraction methods for scientific documents typically use rule-based (RB) or machine learning (ML) approaches.
We propose a new document layout analysis framework called CTBR(Compartment & Text Blocks Refinement)
arXiv Detail & Related papers (2023-12-14T15:36:49Z) - Text Classification: A Perspective of Deep Learning Methods [0.0679877553227375]
This paper introduces deep learning-based text classification algorithms, including important steps required for text classification tasks.
At the end of the article, different deep learning text classification methods are compared and summarized.
arXiv Detail & Related papers (2023-09-24T21:49:51Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Bias and unfairness in machine learning models: a systematic literature
review [43.55994393060723]
This study aims to examine existing knowledge on bias and unfairness in Machine Learning models.
A Systematic Literature Review found 40 eligible articles published between 2017 and 2022 in the Scopus, IEEE Xplore, Web of Science, and Google Scholar knowledge bases.
arXiv Detail & Related papers (2022-02-16T16:27:00Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - Active Learning from Crowd in Document Screening [76.9545252341746]
We focus on building a set of machine learning classifiers that evaluate documents, and then screen them efficiently.
We propose a multi-label active learning screening specific sampling technique -- objective-aware sampling.
We demonstrate that objective-aware sampling significantly outperforms the state of the art active learning sampling strategies.
arXiv Detail & Related papers (2020-11-11T16:17:28Z) - Overcoming the curse of dimensionality with Laplacian regularization in
semi-supervised learning [80.20302993614594]
We provide a statistical analysis to overcome drawbacks of Laplacian regularization.
We unveil a large body of spectral filtering methods that exhibit desirable behaviors.
We provide realistic computational guidelines in order to make our method usable with large amounts of data.
arXiv Detail & Related papers (2020-09-09T14:28:54Z) - Bayesian active learning for production, a systematic study and a
reusable library [85.32971950095742]
In this paper, we analyse the main drawbacks of current active learning techniques.
We do a systematic study on the effects of the most common issues of real-world datasets on the deep active learning process.
We derive two techniques that can speed up the active learning loop such as partial uncertainty sampling and larger query size.
arXiv Detail & Related papers (2020-06-17T14:51:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.