Text classification with word embedding regularization and soft
similarity measure
- URL: http://arxiv.org/abs/2003.05019v1
- Date: Tue, 10 Mar 2020 22:07:34 GMT
- Title: Text classification with word embedding regularization and soft
similarity measure
- Authors: V\'it Novotn\'y, Eniafe Festus Ayetiran, Michal \v{S}tef\'anik, and
Petr Sojka
- Abstract summary: Two word embedding regularization techniques were shown to reduce storage and memory costs, and to improve training speed, document processing speed, and task performance.
We show 39% average $k$NN test error reduction with regularized word embeddings compared to non-regularized word embeddings.
We also show that the SCM with regularized word embeddings significantly outperforms the WMD on text classification and is over 10,000 times faster.
- Score: 0.20999222360659603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the seminal work of Mikolov et al., word embeddings have become the
preferred word representations for many natural language processing tasks.
Document similarity measures extracted from word embeddings, such as the soft
cosine measure (SCM) and the Word Mover's Distance (WMD), were reported to
achieve state-of-the-art performance on semantic text similarity and text
classification.
Despite the strong performance of the WMD on text classification and semantic
text similarity, its super-cubic average time complexity is impractical. The
SCM has quadratic worst-case time complexity, but its performance on text
classification has never been compared with the WMD. Recently, two word
embedding regularization techniques were shown to reduce storage and memory
costs, and to improve training speed, document processing speed, and task
performance on word analogy, word similarity, and semantic text similarity.
However, the effect of these techniques on text classification has not yet been
studied.
In our work, we investigate the individual and joint effect of the two word
embedding regularization techniques on the document processing speed and the
task performance of the SCM and the WMD on text classification. For evaluation,
we use the $k$NN classifier and six standard datasets: BBCSPORT, TWITTER,
OHSUMED, REUTERS-21578, AMAZON, and 20NEWS.
We show 39% average $k$NN test error reduction with regularized word
embeddings compared to non-regularized word embeddings. We describe a practical
procedure for deriving such regularized embeddings through Cholesky
factorization. We also show that the SCM with regularized word embeddings
significantly outperforms the WMD on text classification and is over 10,000
times faster.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Integrating Bidirectional Long Short-Term Memory with Subword Embedding
for Authorship Attribution [2.3429306644730854]
Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution.
The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50.
arXiv Detail & Related papers (2023-06-26T11:35:47Z) - SynWMD: Syntax-aware Word Mover's Distance for Sentence Similarity
Evaluation [36.5590780726458]
Word Mover's Distance (WMD) computes the distance between words and models text similarity with the moving cost between words in two text sequences.
An improved WMD method using the syntactic parse tree, called Syntax-aware Word Mover's Distance (SynWMD), is proposed to address these two shortcomings in this work.
arXiv Detail & Related papers (2022-06-20T22:30:07Z) - Many-Class Text Classification with Matching [65.74328417321738]
We formulate textbfText textbfClassification as a textbfMatching problem between the text and the labels, and propose a simple yet effective framework named TCM.
Compared with previous text classification approaches, TCM takes advantage of the fine-grained semantic information of the classification labels.
arXiv Detail & Related papers (2022-05-23T15:51:19Z) - FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric [48.66580267438049]
We present FastKASSIM, a metric for utterance- and document-level syntactic similarity.
It pairs and averages the most similar dependency parse trees between a pair of documents based on tree kernels.
It runs up to to 5.2 times faster than our baseline method over the documents in the r/ChangeMyView corpus.
arXiv Detail & Related papers (2022-03-15T22:33:26Z) - Divide and Conquer: Text Semantic Matching with Disentangled Keywords
and Intents [19.035917264711664]
We propose a training strategy for text semantic matching by disentangling keywords from intents.
Our approach can be easily combined with pre-trained language models (PLM) without influencing their inference efficiency.
arXiv Detail & Related papers (2022-03-06T07:48:24Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - EDS-MEMBED: Multi-sense embeddings based on enhanced distributional
semantic structures via a graph walk over word senses [0.0]
We leverage the rich semantic structures in WordNet to enhance the quality of multi-sense embeddings.
We derive new distributional semantic similarity measures for M-SE from prior ones.
We report evaluation results on 11 benchmark datasets involving WSD and Word Similarity tasks.
arXiv Detail & Related papers (2021-02-27T14:36:55Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.