Comparative Study of Long Document Classification
- URL: http://arxiv.org/abs/2111.00702v1
- Date: Mon, 1 Nov 2021 04:51:51 GMT
- Title: Comparative Study of Long Document Classification
- Authors: Vedangi Wagh, Snehal Khandve, Isha Joshi, Apurva Wani, Geetanjali
Kale, Raviraj Joshi
- Abstract summary: We revisit long document classification using standard machine learning approaches.
We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The amount of information stored in the form of documents on the internet has
been increasing rapidly. Thus it has become a necessity to organize and
maintain these documents in an optimum manner. Text classification algorithms
study the complex relationships between words in a text and try to interpret
the semantics of the document. These algorithms have evolved significantly in
the past few years. There has been a lot of progress from simple machine
learning algorithms to transformer-based architectures. However, existing
literature has analyzed different approaches on different data sets thus making
it difficult to compare the performance of machine learning algorithms. In this
work, we revisit long document classification using standard machine learning
approaches. We benchmark approaches ranging from simple Naive Bayes to complex
BERT on six standard text classification datasets. We present an exhaustive
comparison of different algorithms on a range of long document datasets. We
re-iterate that long document classification is a simpler task and even basic
algorithms perform competitively with BERT-based approaches on most of the
datasets. The BERT-based models perform consistently well on all the datasets
and can be blindly used for the document classification task when the
computations cost is not a concern. In the shallow model's category, we suggest
the usage of raw BiLSTM + Max architecture which performs decently across all
the datasets. Even simpler Glove + Attention bag of words model can be utilized
for simpler use cases. The importance of using sophisticated models is clearly
visible in the IMDB sentiment dataset which is a comparatively harder task.
Related papers
- A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Document Provenance and Authentication through Authorship Classification [5.2545206693029884]
We propose an ensemble-based text-processing framework for the classification of single and multi-authored documents.
The proposed framework incorporates several state-of-the-art text classification algorithms.
The framework is evaluated on a large-scale benchmark dataset.
arXiv Detail & Related papers (2023-03-02T12:26:03Z) - Document-Level Abstractive Summarization [0.0]
We study how efficient Transformer techniques can be used to improve the automatic summarization of very long texts.
We propose a novel retrieval-enhanced approach which reduces the cost of generating a summary of the entire document by processing smaller chunks.
arXiv Detail & Related papers (2022-12-06T14:39:09Z) - Contextualization for the Organization of Text Documents Streams [0.0]
We present several experiments with some stream analysis methods to explore streams of text documents.
We use only dynamic algorithms to explore, analyze, and organize the flux of text documents.
arXiv Detail & Related papers (2022-05-30T22:25:40Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Efficient Classification of Long Documents Using Transformers [13.927622630633344]
We evaluate the relative efficacy measured against various baselines and diverse datasets.
Results show that more complex models often fail to outperform simple baselines and yield inconsistent performance across datasets.
arXiv Detail & Related papers (2022-03-21T18:36:18Z) - Hierarchical Neural Network Approaches for Long Document Classification [3.6700088931938835]
We employ pre-trained Universal Sentence (USE) and Bidirectional Representations from Transformers (BERT) in a hierarchical setup to capture better representations efficiently.
Our proposed models are conceptually simple where we divide the input data into chunks and then pass this through base models of BERT and USE.
We show that USE + CNN/LSTM performs better than its stand-alone baseline. Whereas the BERT + CNN/LSTM performs on par with its stand-alone counterpart.
arXiv Detail & Related papers (2022-01-18T07:17:40Z) - Document-Level Text Simplification: Dataset, Criteria and Baseline [75.58761130635824]
We define and investigate a new task of document-level text simplification.
Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia.
We propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task.
arXiv Detail & Related papers (2021-10-11T08:15:31Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.