Searching for Discriminative Words in Multidimensional Continuous
Feature Space
- URL: http://arxiv.org/abs/2211.14631v1
- Date: Sat, 26 Nov 2022 18:05:11 GMT
- Title: Searching for Discriminative Words in Multidimensional Continuous
Feature Space
- Authors: Marius Sajgalik and Michal Barla and Maria Bielikova
- Abstract summary: We propose a novel method to extract discriminative keywords from documents.
We show how different discriminative metrics influence the overall results.
We conclude that word feature vectors can substantially improve the topical inference of documents' meaning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Word feature vectors have been proven to improve many NLP tasks. With recent
advances in unsupervised learning of these feature vectors, it became possible
to train it with much more data, which also resulted in better quality of
learned features. Since it learns joint probability of latent features of
words, it has the advantage that we can train it without any prior knowledge
about the goal task we want to solve. We aim to evaluate the universal
applicability property of feature vectors, which has been already proven to
hold for many standard NLP tasks like part-of-speech tagging or syntactic
parsing. In our case, we want to understand the topical focus of text documents
and design an efficient representation suitable for discriminating different
topics. The discriminativeness can be evaluated adequately on text
categorisation task. We propose a novel method to extract discriminative
keywords from documents. We utilise word feature vectors to understand the
relations between words better and also understand the latent topics which are
discussed in the text and not mentioned directly but inferred logically. We
also present a simple way to calculate document feature vectors out of
extracted discriminative words. We evaluate our method on the four most popular
datasets for text categorisation. We show how different discriminative metrics
influence the overall results. We demonstrate the effectiveness of our approach
by achieving state-of-the-art results on text categorisation task using just a
small number of extracted keywords. We prove that word feature vectors can
substantially improve the topical inference of documents' meaning. We conclude
that distributed representation of words can be used to build higher levels of
abstraction as we demonstrate and build feature vectors of documents.
Related papers
- Word Embedding Dimension Reduction via Weakly-Supervised Feature Selection [34.217661429283666]
As the vocabulary grows, the vector space's dimension increases, which can lead to a vast model size.
This paper explores word embedding dimension reduction.
We propose an efficient and effective weakly-supervised feature selection method named WordFS.
arXiv Detail & Related papers (2024-07-17T06:36:09Z) - Span-Aggregatable, Contextualized Word Embeddings for Effective Phrase Mining [0.22499166814992438]
We show that when target phrases reside inside noisy context, representing the full sentence with a single dense vector is not sufficient for effective phrase retrieval.
We show that this technique is much more effective for phrase mining, yet requires considerable compute to obtain useful span representations.
arXiv Detail & Related papers (2024-05-12T12:08:05Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Tsetlin Machine Embedding: Representing Words Using Logical Expressions [10.825099126920028]
We introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised.
The clauses consist of contextual words like "black," "cup," and "hot" to define other words like "coffee"
We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks.
arXiv Detail & Related papers (2023-01-02T15:02:45Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Deriving Word Vectors from Contextualized Language Models using
Topic-Aware Mention Selection [46.97185212695267]
We propose a method for learning word representations that follows this basic strategy.
We take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts.
We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
arXiv Detail & Related papers (2021-06-15T08:02:42Z) - EDS-MEMBED: Multi-sense embeddings based on enhanced distributional
semantic structures via a graph walk over word senses [0.0]
We leverage the rich semantic structures in WordNet to enhance the quality of multi-sense embeddings.
We derive new distributional semantic similarity measures for M-SE from prior ones.
We report evaluation results on 11 benchmark datasets involving WSD and Word Similarity tasks.
arXiv Detail & Related papers (2021-02-27T14:36:55Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.