Generating Word and Document Embeddings for Sentiment Analysis
- URL: http://arxiv.org/abs/2001.01269v2
- Date: Mon, 7 Dec 2020 18:33:45 GMT
- Title: Generating Word and Document Embeddings for Sentiment Analysis
- Authors: Cem R{\i}fk{\i} Ayd{\i}n, Tunga G\"ung\"or, Ali Erkan
- Abstract summary: In this paper, we combine contextual and supervised information with the general semantic representations of words occurring in the dictionary.
We induce domain-specific sentimental vectors for two corpora, which are the movie domain and the Twitter datasets in Turkish.
It shows that our approaches are cross-domain and portable to other languages.
- Score: 0.36525095710982913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentiments of words differ from one corpus to another. Inducing general
sentiment lexicons for languages and using them cannot, in general, produce
meaningful results for different domains. In this paper, we combine contextual
and supervised information with the general semantic representations of words
occurring in the dictionary. Contexts of words help us capture the
domain-specific information and supervised scores of words are indicative of
the polarities of those words. When we combine supervised features of words
with the features extracted from their dictionary definitions, we observe an
increase in the success rates. We try out the combinations of contextual,
supervised, and dictionary-based approaches, and generate original vectors. We
also combine the word2vec approach with hand-crafted features. We induce
domain-specific sentimental vectors for two corpora, which are the movie domain
and the Twitter datasets in Turkish. When we thereafter generate document
vectors and employ the support vector machines method utilising those vectors,
our approaches perform better than the baseline studies for Turkish with a
significant margin. We evaluated our models on two English corpora as well and
these also outperformed the word2vec approach. It shows that our approaches are
cross-domain and portable to other languages.
Related papers
- Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - Tsetlin Machine Embedding: Representing Words Using Logical Expressions [10.825099126920028]
We introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised.
The clauses consist of contextual words like "black," "cup," and "hot" to define other words like "coffee"
We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks.
arXiv Detail & Related papers (2023-01-02T15:02:45Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - Deriving Word Vectors from Contextualized Language Models using
Topic-Aware Mention Selection [46.97185212695267]
We propose a method for learning word representations that follows this basic strategy.
We take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts.
We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
arXiv Detail & Related papers (2021-06-15T08:02:42Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - WOVe: Incorporating Word Order in GloVe Word Embeddings [0.0]
Defining a word as a vector makes it easy for machine learning algorithms to understand a text and extract information from it.
Word vector representations have been used in many applications such word synonyms, word analogy, syntactic parsing, and many others.
arXiv Detail & Related papers (2021-05-18T15:28:20Z) - Robust and Consistent Estimation of Word Embedding for Bangla Language
by fine-tuning Word2Vec Model [1.2691047660244335]
We analyze word2vec model for learning word vectors and present the most effective word embedding for Bangla language.
We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article for extrinsic evaluation.
arXiv Detail & Related papers (2020-10-26T08:00:48Z) - Principal Word Vectors [5.64434321651888]
We generalize principal component analysis for embedding words into a vector space.
We show that the spread and the discriminability of the principal word vectors are higher than that of other word embedding methods.
arXiv Detail & Related papers (2020-07-09T08:29:57Z) - Word Rotator's Distance [50.67809662270474]
Key principle in assessing textual similarity is measuring the degree of semantic overlap between two texts by considering the word alignment.
We show that the norm of word vectors is a good proxy for word importance, and their angle is a good proxy for word similarity.
We propose a method that first decouples word vectors into their norm and direction, and then computes alignment-based similarity.
arXiv Detail & Related papers (2020-04-30T17:48:42Z) - Lexical Sememe Prediction using Dictionary Definitions by Capturing
Local Semantic Correspondence [94.79912471702782]
Sememes, defined as the minimum semantic units of human languages, have been proven useful in many NLP tasks.
We propose a Sememe Correspondence Pooling (SCorP) model, which is able to capture this kind of matching to predict sememes.
We evaluate our model and baseline methods on a famous sememe KB HowNet and find that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-01-16T17:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.