Accelerating Text Mining Using Domain-Specific Stop Word Lists
- URL: http://arxiv.org/abs/2012.02294v1
- Date: Wed, 18 Nov 2020 17:42:32 GMT
- Title: Accelerating Text Mining Using Domain-Specific Stop Word Lists
- Authors: Farah Alshanik, Amy Apon, Alexander Herzog, Ilya Safro, Justin
Sybrandt
- Abstract summary: We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
- Score: 57.76576681191192
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text preprocessing is an essential step in text mining. Removing words that
can negatively impact the quality of prediction algorithms or are not
informative enough is a crucial storage-saving technique in text indexing and
results in improved computational efficiency. Typically, a generic stop word
list is applied to a dataset regardless of the domain. However, many common
words are different from one domain to another but have no significance within
a particular domain. Eliminating domain-specific common words in a corpus
reduces the dimensionality of the feature space, and improves the performance
of text mining tasks. In this paper, we present a novel mathematical approach
for the automatic extraction of domain-specific words called the
hyperplane-based approach. This new approach depends on the notion of low
dimensional representation of the word in vector space and its distance from
hyperplane. The hyperplane-based approach can significantly reduce text
dimensionality by eliminating irrelevant features. We compare the
hyperplane-based approach with other feature selection methods, namely \c{hi}2
and mutual information. An experimental study is performed on three different
datasets and five classification algorithms, and measure the dimensionality
reduction and the increase in the classification performance. Results indicate
that the hyperplane-based approach can reduce the dimensionality of the corpus
by 90% and outperforms mutual information. The computational time to identify
the domain-specific words is significantly lower than mutual information.
Related papers
- LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - Word Embedding Dimension Reduction via Weakly-Supervised Feature Selection [34.217661429283666]
As the vocabulary grows, the vector space's dimension increases, which can lead to a vast model size.
This paper explores word embedding dimension reduction.
We propose an efficient and effective weakly-supervised feature selection method named WordFS.
arXiv Detail & Related papers (2024-07-17T06:36:09Z) - Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression [15.460141768587663]
We propose a lightweight supervised dictionary learning framework for text classification based on data compression and representation.
We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance.
arXiv Detail & Related papers (2024-04-28T10:11:52Z) - Unsupervised Domain Adaptation for Sparse Retrieval by Filling
Vocabulary and Word Frequency Gaps [12.573927420408365]
IR models using a pretrained language model significantly outperform lexical approaches like BM25.
This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps.
We show that our method outperforms the present stateof-the-art domain adaptation method.
arXiv Detail & Related papers (2022-11-08T03:58:26Z) - Word Embeddings and Validity Indexes in Fuzzy Clustering [5.063728016437489]
fuzzy-based analysis of various vector representations of words, i.e., word embeddings.
We use two popular fuzzy clustering algorithms on count-based word embeddings, with different methods and dimensionality.
We evaluate results of experiments with various clustering validity indexes to compare different algorithm variation with different embeddings accuracy.
arXiv Detail & Related papers (2022-04-26T18:08:19Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Group-Sparse Matrix Factorization for Transfer Learning of Word
Embeddings [31.849734024331283]
We propose an intuitive estimator that exploits structure via a groupsparse penalty to efficiently transfer learn domainspecific word embeddings.
We prove that all local minima identified by our noncorpora objective function are statistically indistinguishable from the minimum under standard regularization conditions.
arXiv Detail & Related papers (2021-04-18T18:19:03Z) - PointFlow: Flowing Semantics Through Points for Aerial Image
Segmentation [96.76882806139251]
We propose a point-wise affinity propagation module based on the Feature Pyramid Network (FPN) framework, named PointFlow.
Rather than dense affinity learning, a sparse affinity map is generated upon selected points between the adjacent features.
Experimental results on three different aerial segmentation datasets suggest that the proposed method is more effective and efficient than state-of-the-art general semantic segmentation methods.
arXiv Detail & Related papers (2021-03-11T09:42:32Z) - Text Information Aggregation with Centrality Attention [86.91922440508576]
We propose a new way of obtaining aggregation weights, called eigen-centrality self-attention.
We build a fully-connected graph for all the words in a sentence, then compute the eigen-centrality as the attention score of each word.
arXiv Detail & Related papers (2020-11-16T13:08:48Z) - Affinity Space Adaptation for Semantic Segmentation Across Domains [57.31113934195595]
In this paper, we address the problem of unsupervised domain adaptation (UDA) in semantic segmentation.
Motivated by the fact that source and target domain have invariant semantic structures, we propose to exploit such invariance across domains.
We develop two affinity space adaptation strategies: affinity space cleaning and adversarial affinity space alignment.
arXiv Detail & Related papers (2020-09-26T10:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.