Improving the Efficiency of Long Document Classification using Sentence Ranking Approach
- URL: http://arxiv.org/abs/2506.07248v2
- Date: Sun, 22 Jun 2025 07:31:53 GMT
- Title: Improving the Efficiency of Long Document Classification using Sentence Ranking Approach
- Authors: Prathamesh Kokate, Mitali Sarnaik, Manavi Khopade, Raviraj Joshi,
- Abstract summary: We propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content.<n>Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length.<n>We achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent.
- Score: 0.4499833362998489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.
Related papers
- Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach [27.581678327762003]
numerical consistency across tables in disclosure documents is critical for ensuring accuracy, maintaining credibility, and reputational and economic risks.<n>This paper introduces CoFiTCheck, a novel framework that addresses these challenges through two sequential stages: embedding-based filtering and discriminative classification.<n>CoFiTCheck significantly outperforms previous methods while maintaining practical efficiency.
arXiv Detail & Related papers (2025-06-16T10:17:21Z) - Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection [52.716143424856185]
We propose LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection.<n>LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors.<n>Our method also outperforms the greedy search in attribution efficiency, being 1.6 times faster.
arXiv Detail & Related papers (2025-04-01T06:58:15Z) - Enhanced Retrieval of Long Documents: Leveraging Fine-Grained Block Representations with Large Language Models [24.02950598944251]
We introduce a novel, fine-grained approach aimed at enhancing the accuracy of relevance scoring for long documents.<n>Our methodology firstly segments a long document into blocks, each of which is embedded using an LLM.<n>We aggregate the query-block relevance scores through a weighted sum method, yielding a comprehensive score for the query with the entire document.
arXiv Detail & Related papers (2025-01-28T16:03:52Z) - Low-Resource Fast Text Classification Based on Intra-Class and Inter-Class Distance Calculation [1.0291559330120414]
We propose a low-resource and fast text classification model called LFTC.<n>Our approach begins by constructing a compressor list for each class to fully mine the regularity information within intra-class data.<n>We evaluate LFTC on 9 publicly available benchmark datasets, and the results demonstrate significant improvements in performance and processing time.
arXiv Detail & Related papers (2024-12-13T07:22:13Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - ChuLo: Chunk-Level Key Information Representation for Long Document Processing [11.29459225491404]
ChuLo is a novel chunk representation method for long document understanding.<n>Our approach minimizes information loss and improves the efficiency of Transformer-based models.
arXiv Detail & Related papers (2024-10-14T22:06:54Z) - Zero-Shot Listwise Document Reranking with a Large Language Model [58.64141622176841]
We propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific training data.
Experiments on three TREC web search datasets demonstrate that LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but can also act as a final-stage reranker.
arXiv Detail & Related papers (2023-05-03T14:45:34Z) - Distant finetuning with discourse relations for stance classification [55.131676584455306]
We propose a new method to extract data with silver labels from raw text to finetune a model for stance classification.
We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages.
Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater.
arXiv Detail & Related papers (2022-04-27T04:24:35Z) - Factual Error Correction for Abstractive Summaries Using Entity
Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary.
Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - Automating Document Classification with Distant Supervision to Increase
the Efficiency of Systematic Reviews [18.33687903724145]
Well-done systematic reviews are expensive, time-demanding, and labor-intensive.
We propose an automatic document classification approach to significantly reduce the effort in reviewing documents.
arXiv Detail & Related papers (2020-12-09T22:45:40Z) - Semantic Sensitive TF-IDF to Determine Word Relevance in Documents [0.0]
We propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus.
Our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.
arXiv Detail & Related papers (2020-01-06T00:23:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.