StyloMetrix: An Open-Source Multilingual Tool for Representing
Stylometric Vectors
- URL: http://arxiv.org/abs/2309.12810v1
- Date: Fri, 22 Sep 2023 11:53:47 GMT
- Title: StyloMetrix: An Open-Source Multilingual Tool for Representing
Stylometric Vectors
- Authors: Inez Okulska, Daria Stetsenko, Anna Ko{\l}os, Agnieszka Karli\'nska,
Kinga G{\l}\k{a}bi\'nska, Adam Nowakowski
- Abstract summary: This work aims to provide an overview of the open-source multilanguage tool called StyloMetrix.
It offers stylometric text representations that cover various aspects of grammar, syntax and lexicon.
StyloMetrix covers four languages: Polish as the primary language, English, Ukrainian and Russian.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This work aims to provide an overview on the open-source multilanguage tool
called StyloMetrix. It offers stylometric text representations that cover
various aspects of grammar, syntax and lexicon. StyloMetrix covers four
languages: Polish as the primary language, English, Ukrainian and Russian. The
normalized output of each feature can become a fruitful course for machine
learning models and a valuable addition to the embeddings layer for any deep
learning algorithm. We strive to provide a concise, but exhaustive overview on
the application of the StyloMetrix vectors as well as explain the sets of the
developed linguistic features. The experiments have shown promising results in
supervised content classification with simple algorithms as Random Forest
Classifier, Voting Classifier, Logistic Regression and others. The deep
learning assessments have unveiled the usefulness of the StyloMetrix vectors at
enhancing an embedding layer extracted from Transformer architectures. The
StyloMetrix has proven itself to be a formidable source for the machine
learning and deep learning algorithms to execute different classification
tasks.
Related papers
- Comparative Analysis of Multilingual Text Classification &
Identification through Deep Learning and Embedding Visualization [0.0]
The study employs LangDetect, LangId, FastText, and Sentence Transformer on a dataset encompassing 17 languages.
The FastText multi-layer perceptron model achieved remarkable accuracy, precision, recall, and F1 score, outperforming the Sentence Transformer model.
arXiv Detail & Related papers (2023-12-06T12:03:27Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian
Language [0.0]
The StyloMetrix is a tool to analyze grammatical, stylistic, and syntactic patterns in English, Spanish, German, and others.
We describe the StyloMetrix pipeline and provide some experiments with this tool for the text classification task.
We also describe our package's main limitations and the metrics' evaluation procedure.
arXiv Detail & Related papers (2023-05-22T22:52:47Z) - GENIUS: Sketch-based Language Model Pre-training via Extreme and
Selective Masking for Text Generation and Augmentation [76.7772833556714]
We introduce GENIUS: a conditional text generation model using sketches as input.
GENIUS is pre-trained on a large-scale textual corpus with a novel reconstruction from sketch objective.
We show that GENIUS can be used as a strong and ready-to-use data augmentation tool for various natural language processing (NLP) tasks.
arXiv Detail & Related papers (2022-11-18T16:39:45Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Latin writing styles analysis with Machine Learning: New approach to old
questions [0.0]
In the Middle Ages texts were learned by heart and spread using oral means of communication from generation to generation.
Taking into account such a specific construction of literature composed in Latin, we can search for and indicate the probability patterns of familiar sources of specific narrative texts.
arXiv Detail & Related papers (2021-09-01T20:21:45Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.