More Romanian word embeddings from the RETEROM project
- URL: http://arxiv.org/abs/2111.10750v1
- Date: Sun, 21 Nov 2021 06:05:12 GMT
- Title: More Romanian word embeddings from the RETEROM project
- Authors: Vasile P\u{a}i\c{s}, Dan Tufi\c{s}
- Abstract summary: "word embeddings" are automatically learned vector representations of words.
We plan to develop an openaccess large library of ready-to-use word embeddings sets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatically learned vector representations of words, also known as "word
embeddings", are becoming a basic building block for more and more natural
language processing algorithms. There are different ways and tools for
constructing word embeddings. Most of the approaches rely on raw texts, the
construction items being the word occurrences and/or letter n-grams. More
elaborated research is using additional linguistic features extracted after
text preprocessing. Morphology is clearly served by vector representations
constructed from raw texts and letter n-grams. Syntax and semantics studies may
profit more from the vector representations constructed with additional
features such as lemma, part-of-speech, syntactic or semantic dependants
associated with each word. One of the key objectives of the ReTeRom project is
the development of advanced technologies for Romanian natural language
processing, including morphological, syntactic and semantic analysis of text.
As such, we plan to develop an open-access large library of ready-to-use word
embeddings sets, each set being characterized by different parameters: used
features (wordforms, letter n-grams, lemmas, POSes etc.), vector lengths,
window/context size and frequency thresholds. To this end, the previously
created sets of word embeddings (based on word occurrences) on the CoRoLa
corpus (P\u{a}i\c{s} and Tufi\c{s}, 2018) are and will be further augmented
with new representations learned from the same corpus by using specific
features such as lemmas and parts of speech. Furthermore, in order to better
understand and explore the vectors, graphical representations will be available
by customized interfaces.
Related papers
- From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding [22.390804161191635]
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens.
This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.
We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
arXiv Detail & Related papers (2023-05-23T23:22:20Z) - RWEN-TTS: Relation-aware Word Encoding Network for Natural
Text-to-Speech Synthesis [3.591224588041813]
A huge number of text-to-speech (TTS) models produce human-like speech.
Relation-aware Word Network (RWEN) effectively allows syntactic and semantic information based on two modules.
Experimental results show substantial improvements compared to previous works.
arXiv Detail & Related papers (2022-12-15T16:17:03Z) - Comparing Performance of Different Linguistically-Backed Word Embeddings
for Cyberbullying Detection [3.029434408969759]
In most cases, word embeddings are learned only from raw tokens or in some cases, lemmas.
We propose to preserve the morphological, syntactic and other types of linguistic information by combining them with the raw tokens or lemmas.
arXiv Detail & Related papers (2022-06-04T09:11:41Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Modelling the semantics of text in complex document layouts using graph
transformer networks [0.0]
We propose a model that approximates the human reading pattern of a document and outputs a unique semantic representation for every text span.
We base our architecture on a graph representation of the structured text, and we demonstrate that not only can we retrieve semantically similar information across documents but also that the embedding space we generate captures useful semantic information.
arXiv Detail & Related papers (2022-02-18T11:49:06Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - Deriving Word Vectors from Contextualized Language Models using
Topic-Aware Mention Selection [46.97185212695267]
We propose a method for learning word representations that follows this basic strategy.
We take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts.
We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
arXiv Detail & Related papers (2021-06-15T08:02:42Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.