Natural Language Processing Models for Robust Document Categorization
- URL: http://arxiv.org/abs/2602.20336v1
- Date: Mon, 23 Feb 2026 20:33:22 GMT
- Title: Natural Language Processing Models for Robust Document Categorization
- Authors: Radoslaw Roszczyk, Pawel Tecza, Maciej Stodolski, Krzysztof Siwek,
- Abstract summary: The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines.<n>Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model.<n>The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99%, but required significantly longer training times and greater computational resources.<n>The BiLSTM model provided a strong compromise, reaching approximately 98.56% accuracy while maintaining moderate training costs and offering robust contextual understanding.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99\%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56\% accuracy while maintaining moderate training costs and offering robust contextual understanding. Naive Bayes proved to be the fastest to train, on the order of milliseconds, yet delivered the lowest accuracy, averaging around 94.5\%. Class imbalance influenced all methods, particularly in the recognition of minority categories. A fully functional demonstrative system was implemented to validate practical applicability, enabling automated routing of technical requests with throughput unattainable through manual processing. The study concludes that BiLSTM offers the most balanced solution for the examined scenario, while also outlining opportunities for future improvements and further exploration of transformer architectures.
Related papers
- Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols [123.73663884421272]
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.<n>We establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets.<n>By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research.
arXiv Detail & Related papers (2026-02-28T05:41:57Z) - Hybrid Synthetic Data Generation with Domain Randomization Enables Zero-Shot Vision-Based Part Inspection Under Extreme Class Imbalance [3.7696918637188817]
Training robust machine learning models requires large volumes of high-quality labeled data.<n>Defective samples are intrinsically rare, leading to severe class imbalance that degrades model performance.<n>Synthetic data generation offers a promising solution by enabling the creation of large, balanced, and fully annotated datasets.
arXiv Detail & Related papers (2025-11-28T05:30:49Z) - SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z) - From Requirements to Test Cases: An NLP-Based Approach for High-Performance ECU Test Case Automation [0.5249805590164901]
This study investigates the use of Natural Language Processing techniques to transform natural language requirements into structured test case specifications.<n>A dataset of 400 feature element documents was used to evaluate both approaches for extracting key elements such as signal names and values.<n>The Rule-Based method outperforms the NER method, achieving 95% accuracy for more straightforward requirements with single signals.
arXiv Detail & Related papers (2025-05-01T14:23:55Z) - iFuzzyTL: Interpretable Fuzzy Transfer Learning for SSVEP BCI System [24.898026682692688]
This study explores advanced classification techniques leveraging interpretable fuzzy transfer learning (iFuzzyTL)
iFuzzyTL refines input signal processing and classification in a human-interpretable format by integrating fuzzy inference systems and attention mechanisms.
The model's efficacy is demonstrated across three datasets.
arXiv Detail & Related papers (2024-10-16T06:07:23Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Uncertainty-aware Sampling for Long-tailed Semi-supervised Learning [89.98353600316285]
We introduce uncertainty into the modeling process for pseudo-label sampling, taking into account that the model performance on the tailed classes varies over different training stages.
This approach allows the model to perceive the uncertainty of pseudo-labels at different training stages, thereby adaptively adjusting the selection thresholds for different classes.
Compared to other methods such as the baseline method FixMatch, UDTS achieves an increase in accuracy of at least approximately 5.26%, 1.75%, 9.96%, and 1.28% on the natural scene image datasets.
arXiv Detail & Related papers (2024-01-09T08:59:39Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Efficient Learning of Accurate Surrogates for Simulations of Complex Systems [0.0]
We introduce an online learning method empowered by sampling-driven sampling.
It ensures that all turning points on the model response surface are included in the training data.
We apply our method to simulations of nuclear matter to demonstrate that highly accurate surrogates can be reliably auto-generated.
arXiv Detail & Related papers (2022-07-11T20:51:11Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - An Automated Knowledge Mining and Document Classification System with
Multi-model Transfer Learning [1.1852751647387592]
Service manual documents are crucial to the engineering company as they provide guidelines and knowledge to service engineers.
We propose an automated knowledge mining and document classification system with novel multi-model transfer learning approaches.
arXiv Detail & Related papers (2021-06-24T03:03:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.