Related papers: Word Embedding Dimension Reduction via Weakly-Supervised Feature Selection

Word Embedding Dimension Reduction via Weakly-Supervised Feature Selection

URL: http://arxiv.org/abs/2407.12342v2
Date: Mon, 4 Nov 2024 09:52:25 GMT
Title: Word Embedding Dimension Reduction via Weakly-Supervised Feature Selection
Authors: Jintang Xue, Yun-Cheng Wang, Chengwei Wei, C. -C. Jay Kuo,
Abstract summary: As the vocabulary grows, the vector space's dimension increases, which can lead to a vast model size. This paper explores word embedding dimension reduction. We propose an efficient and effective weakly-supervised feature selection method named WordFS.
Score: 34.217661429283666
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As a fundamental task in natural language processing, word embedding converts each word into a representation in a vector space. A challenge with word embedding is that as the vocabulary grows, the vector space's dimension increases, which can lead to a vast model size. Storing and processing word vectors are resource-demanding, especially for mobile edge-devices applications. This paper explores word embedding dimension reduction. To balance computational costs and performance, we propose an efficient and effective weakly-supervised feature selection method named WordFS. It has two variants, each utilizing novel criteria for feature selection. Experiments on various tasks (e.g., word and sentence similarity and binary and multi-class classification) indicate that the proposed WordFS model outperforms other dimension reduction methods at lower computational costs. We have released the code for reproducibility along with the paper.

Related papers

Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We introduce a novel retrieval unit, proposition, for dense retrieval. Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z)
Frequency-aware Dimension Selection for Static Word Embedding by Mixed Product Distance [22.374525706652207]
This paper proposes a metric (Mixed Product Distance, MPD) to select a proper dimension for word embedding algorithms without training any word embedding. Experiments on both context-unavailable and context-available tasks demonstrate the better efficiency-performance trade-off of our MPD-based dimension selection method over baselines.
arXiv Detail & Related papers (2023-05-13T02:53:37Z)
Searching for Discriminative Words in Multidimensional Continuous Feature Space [0.0]
We propose a novel method to extract discriminative keywords from documents. We show how different discriminative metrics influence the overall results. We conclude that word feature vectors can substantially improve the topical inference of documents' meaning.
arXiv Detail & Related papers (2022-11-26T18:05:11Z)
Exploring Dimensionality Reduction Techniques in Multilingual Transformers [64.78260098263489]
This paper gives a comprehensive account of the impact of dimensional reduction techniques on the performance of state-of-the-art multilingual Siamese Transformers. It shows that it is possible to achieve an average reduction in the number of dimensions of $91.58% pm 2.59%$ and $54.65% pm 32.20%$, respectively.
arXiv Detail & Related papers (2022-04-18T17:20:55Z)
Deriving Word Vectors from Contextualized Language Models using Topic-Aware Mention Selection [46.97185212695267]
We propose a method for learning word representations that follows this basic strategy. We take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts. We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
arXiv Detail & Related papers (2021-06-15T08:02:42Z)
Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach. The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features. Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z)
Robust and Consistent Estimation of Word Embedding for Bangla Language by fine-tuning Word2Vec Model [1.2691047660244335]
We analyze word2vec model for learning word vectors and present the most effective word embedding for Bangla language. We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article for extrinsic evaluation.
arXiv Detail & Related papers (2020-10-26T08:00:48Z)
All Word Embeddings from One Embedding [23.643059189673473]
In neural network-based models for natural language processing, the largest part of the parameters often consists of word embeddings. In this study, to reduce the total number of parameters, the embeddings for all words are represented by transforming a shared embedding. The proposed method, ALONE, constructs the embedding of a word by modifying the shared embedding with a filter vector, which is word-specific but non-trainable.
arXiv Detail & Related papers (2020-04-25T07:38:08Z)
Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix. On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.