Clustering Word Embeddings with Self-Organizing Maps. Application on
LaRoSeDa -- A Large Romanian Sentiment Data Set
- URL: http://arxiv.org/abs/2101.04197v1
- Date: Mon, 11 Jan 2021 21:19:22 GMT
- Title: Clustering Word Embeddings with Self-Organizing Maps. Application on
LaRoSeDa -- A Large Romanian Sentiment Data Set
- Authors: Anca Maria Tache, Mihaela Gaman, Radu Tudor Ionescu
- Abstract summary: Romanian is one of the understudied languages in computational linguistics.
We introduce LaRoSeDa, a large Romanian sentiment data set.
We replace the k-means clustering algorithm with self-organizing maps.
- Score: 15.877673959068455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Romanian is one of the understudied languages in computational linguistics,
with few resources available for the development of natural language processing
tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data
Set, which is composed of 15,000 positive and negative reviews collected from
one of the largest Romanian e-commerce platforms. We employ two sentiment
classification methods as baselines for our new data set, one based on
low-level features (character n-grams) and one based on high-level features
(bag-of-word-embeddings generated by clustering word embeddings with k-means).
As an additional contribution, we replace the k-means clustering algorithm with
self-organizing maps (SOMs), obtaining better results because the generated
clusters of word embeddings are closer to the Zipf's law distribution, which is
known to govern natural language. We also demonstrate the generalization
capacity of using SOMs for the clustering of word embeddings on another
recently-introduced Romanian data set, for text categorization by topic.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - A Process for Topic Modelling Via Word Embeddings [0.0]
This work combines algorithms based on word embeddings, dimensionality reduction, and clustering.
The objective is to obtain topics from a set of unclassified texts.
arXiv Detail & Related papers (2023-10-06T15:10:35Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Isomorphic Cross-lingual Embeddings for Low-Resource Languages [1.5076964620370268]
Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones.
We introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language.
We show consistent gains over current methods in both quality and degree of isomorphism, as measured by bilingual lexicon induction (BLI) and eigenvalue similarity respectively.
arXiv Detail & Related papers (2022-03-28T10:39:07Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual
Relation Classification [0.0]
Current approaches for relation classification are mainly focused on the English language.
We propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup.
For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish.
arXiv Detail & Related papers (2020-10-19T11:08:16Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Low resource language dataset creation, curation and classification:
Setswana and Sepedi -- Extended Abstract [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages.
arXiv Detail & Related papers (2020-03-30T18:03:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.