Document Type Classification using File Names
- URL: http://arxiv.org/abs/2410.01166v1
- Date: Wed, 2 Oct 2024 01:42:19 GMT
- Title: Document Type Classification using File Names
- Authors: Zhijian Li, Stefan Larson, Kevin Leach,
- Abstract summary: Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification.
Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets.
We present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method.
- Score: 7.130525292849283
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification. Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets and computational resources associated with analyzing whole documents. In this paper, we present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method, to accurately and efficiently classify documents based solely on file names that substantially reduces inference time. This approach can distinguish ambiguous file names from the indicative file names through confidence scores and through using a negative class representing ambiguous file names. Our results indicate that file name classifiers can process more than 80% of the in-scope data with 96.7% accuracy when tested on a dataset with a large portion of out-of-scope data with respect to the training dataset while being 442.43x faster than more complex models such as DiT. Our method offers a crucial solution for efficiently processing vast datasets in critical scenarios, enabling fast, more reliable document classification.
Related papers
- M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - FastClass: A Time-Efficient Approach to Weakly-Supervised Text
Classification [14.918600168973564]
This paper proposes FastClass, an efficient weakly-supervised classification approach.
It uses dense text representation to retrieve class-relevant documents from external unlabeled corpus.
Experiments show that the proposed approach frequently outperforms keyword-driven models in terms of classification accuracy and often enjoys orders-of-magnitude faster training speed.
arXiv Detail & Related papers (2022-12-11T13:43:22Z) - Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval
on Predefined Topics [0.6767885381740952]
We introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset.
The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability.
For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.
arXiv Detail & Related papers (2022-10-12T08:57:01Z) - Efficient Classification of Long Documents Using Transformers [13.927622630633344]
We evaluate the relative efficacy measured against various baselines and diverse datasets.
Results show that more complex models often fail to outperform simple baselines and yield inconsistent performance across datasets.
arXiv Detail & Related papers (2022-03-21T18:36:18Z) - Trustable Co-label Learning from Multiple Noisy Annotators [68.59187658490804]
Supervised deep learning depends on massive accurately annotated examples.
A typical alternative is learning from multiple noisy annotators.
This paper proposes a data-efficient approach, called emphTrustable Co-label Learning (TCL)
arXiv Detail & Related papers (2022-03-08T16:57:00Z) - Improving Probabilistic Models in Text Classification via Active
Learning [0.0]
We propose a fast new model for text classification that combines information from both labeled and unlabeled data with an active learning component.
We show that by introducing information about the structure of unlabeled data and iteratively labeling uncertain documents, our model improves performance.
arXiv Detail & Related papers (2022-02-05T20:09:26Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z) - Web Document Categorization Using Naive Bayes Classifier and Latent
Semantic Analysis [0.7310043452300736]
A rapid growth of web documents necessitates efficient techniques to efficiently classify the document on the web.
We propose a method for web document classification that uses LSA to increase similarity of documents under the same class and improve the classification precision.
Experimental results have shown that using the mentioned preprocessing can improve accuracy and speed of Naive Bayes availably.
arXiv Detail & Related papers (2020-06-02T15:35:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.