RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis
- URL: http://arxiv.org/abs/2601.20275v1
- Date: Wed, 28 Jan 2026 05:43:40 GMT
- Title: RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis
- Authors: Elina Sigdel, Anastasia Panfilova,
- Abstract summary: The paper describes the process of mapping lemmas to 42 psycholinguistic categories and the implementation of the analyzer as part of RusLICA web service.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Defining psycholinguistic characteristics in written texts is a task gaining increasing attention from researchers. One of the most widely used tools in the current field is Linguistic Inquiry and Word Count (LIWC) that originally was developed to analyze English texts and translated into multiple languages. Our approach offers the adaptation of LIWC methodology for the Russian language, considering its grammatical and cultural specificities. The suggested approach comprises 96 categories, integrating syntactic, morphological, lexical, general statistical features, and results of predictions obtained using pre-trained language models (LMs) for text analysis. Rather than applying direct translation to existing thesauri, we built the dictionary specifically for the Russian language based on the content from several lexicographic resources, semantic dictionaries and corpora. The paper describes the process of mapping lemmas to 42 psycholinguistic categories and the implementation of the analyzer as part of RusLICA web service.
Related papers
- Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis [0.5545791216381869]
We explore how agentic large language models (LLMs) can streamline the systematic analysis of annotated corpora.<n>We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation.<n>We test the system on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS)
arXiv Detail & Related papers (2025-11-28T21:27:58Z) - Predicate-Argument Structure Divergences in Chinese and English Parallel Sentences and their Impact on Language Transfer [6.834698677197089]
Cross-lingual Natural Language Processing offers practical solutions in low-resource settings.<n> linguistic divergences hinder language transfer, especially among typologically distant languages.<n>We present an analysis of predicate-argument structures in parallel Chinese and English sentences.
arXiv Detail & Related papers (2025-11-12T22:55:29Z) - From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars [0.17205738196786996]
We introduce a set of benchmarks to evaluate how well models can extract and classify information from linguistic grammars.<n> benchmarks encompass linguistic descriptions for 248 languages across language families, focusing on typological features from WALS and Grambank.<n>This set of benchmarks offers the first comprehensive evaluation of language models' in-context ability to accurately interpret and extract linguistic features.
arXiv Detail & Related papers (2024-11-23T14:47:10Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian
Language [0.0]
The StyloMetrix is a tool to analyze grammatical, stylistic, and syntactic patterns in English, Spanish, German, and others.
We describe the StyloMetrix pipeline and provide some experiments with this tool for the text classification task.
We also describe our package's main limitations and the metrics' evaluation procedure.
arXiv Detail & Related papers (2023-05-22T22:52:47Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark [5.258267224004844]
We introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE.
For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language.
arXiv Detail & Related papers (2020-10-29T20:31:39Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.