The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian
Language
- URL: http://arxiv.org/abs/2305.13530v1
- Date: Mon, 22 May 2023 22:52:47 GMT
- Title: The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian
Language
- Authors: Daria Stetsenko and Inez Okulska
- Abstract summary: The StyloMetrix is a tool to analyze grammatical, stylistic, and syntactic patterns in English, Spanish, German, and others.
We describe the StyloMetrix pipeline and provide some experiments with this tool for the text classification task.
We also describe our package's main limitations and the metrics' evaluation procedure.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper provides an overview of a text mining tool the StyloMetrix
developed initially for the Polish language and further extended for English
and recently for Ukrainian. The StyloMetrix is built upon various metrics
crafted manually by computational linguists and researchers from literary
studies to analyze grammatical, stylistic, and syntactic patterns. The idea of
constructing the statistical evaluation of syntactic and grammar features is
straightforward and familiar for the languages like English, Spanish, German,
and others; it is yet to be developed for low-resource languages like
Ukrainian. We describe the StyloMetrix pipeline and provide some experiments
with this tool for the text classification task. We also describe our package's
main limitations and the metrics' evaluation procedure.
Related papers
- LiMe: a Latin Corpus of Late Medieval Criminal Sentences [39.26357402982764]
We present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani.
arXiv Detail & Related papers (2024-04-19T12:06:28Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - StyloMetrix: An Open-Source Multilingual Tool for Representing
Stylometric Vectors [0.0]
This work aims to provide an overview of the open-source multilanguage tool called StyloMetrix.
It offers stylometric text representations that cover various aspects of grammar, syntax and lexicon.
StyloMetrix covers four languages: Polish as the primary language, English, Ukrainian and Russian.
arXiv Detail & Related papers (2023-09-22T11:53:47Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - NILC-Metrix: assessing the complexity of written and spoken language in
Brazilian Portuguese [0.32622301272834514]
This paper presents and makes publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse.
The metrics in NILC-Metrix were developed during the last 13 years, starting in 2008 with Coh-Metrix-Port, a tool developed within the scope of the PorSimples project.
arXiv Detail & Related papers (2021-12-17T16:51:00Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - The Grammar of Emergent Languages [19.17358904009426]
We show that UGI techniques are appropriate to analyse emergent languages.
We then study if the languages that emerge in a typical referential game setup exhibit syntactic structure.
Our experiments demonstrate that a certain message length and vocabulary size are required for structure to emerge.
arXiv Detail & Related papers (2020-10-05T15:06:27Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.