Comparing Performance of Different Linguistically-Backed Word Embeddings
for Cyberbullying Detection
- URL: http://arxiv.org/abs/2206.01950v1
- Date: Sat, 4 Jun 2022 09:11:41 GMT
- Title: Comparing Performance of Different Linguistically-Backed Word Embeddings
for Cyberbullying Detection
- Authors: Juuso Eronen, Michal Ptaszynski and Fumito Masui
- Abstract summary: In most cases, word embeddings are learned only from raw tokens or in some cases, lemmas.
We propose to preserve the morphological, syntactic and other types of linguistic information by combining them with the raw tokens or lemmas.
- Score: 3.029434408969759
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In most cases, word embeddings are learned only from raw tokens or in some
cases, lemmas. This includes pre-trained language models like BERT. To
investigate on the potential of capturing deeper relations between lexical
items and structures and to filter out redundant information, we propose to
preserve the morphological, syntactic and other types of linguistic information
by combining them with the raw tokens or lemmas. This means, for example,
including parts-of-speech or dependency information within the used lexical
features. The word embeddings can then be trained on the combinations instead
of just raw tokens. It is also possible to later apply this method to the
pre-training of huge language models and possibly enhance their performance.
This would aid in tackling problems which are more sophisticated from the point
of view of linguistic representation, such as detection of cyberbullying.
Related papers
- Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Lexical Complexity Prediction: An Overview [13.224233182417636]
The occurrence of unknown words in texts significantly hinders reading comprehension.
computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives.
We present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data.
arXiv Detail & Related papers (2023-03-08T19:35:08Z) - Exploiting Word Semantics to Enrich Character Representations of Chinese
Pre-trained Models [12.0190584907439]
We propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models.
We show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks.
arXiv Detail & Related papers (2022-07-13T02:28:08Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - More Romanian word embeddings from the RETEROM project [0.0]
"word embeddings" are automatically learned vector representations of words.
We plan to develop an openaccess large library of ready-to-use word embeddings sets.
arXiv Detail & Related papers (2021-11-21T06:05:12Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - Human-in-the-Loop Refinement of Word Embeddings [0.0]
We propose a system that incorporates an adaptation of word embedding post-processing, which we call "interactive refitting"
Our approach allows a human to identify and address potential quality issues with word embeddings interactively.
It also allows for better insight into what effect word embeddings, and refinements to word embeddings, have on machine learning pipelines.
arXiv Detail & Related papers (2021-10-06T16:10:32Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.