PUCP-Metrix: A Comprehensive Open-Source Repository of Linguistic Metrics for Spanish
- URL: http://arxiv.org/abs/2511.17402v1
- Date: Fri, 21 Nov 2025 17:03:00 GMT
- Title: PUCP-Metrix: A Comprehensive Open-Source Repository of Linguistic Metrics for Spanish
- Authors: Javier Alonso Villegas Luis, Marco Antonio Sobrevilla Cabezudo,
- Abstract summary: PUCP-Metrix is an open-source repository of 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability.<n>We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines.
- Score: 0.7329092363953698
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Linguistic features remain essential for interpretability and tasks involving style, structure, and readability, but existing Spanish tools offer limited coverage. We present PUCP-Metrix, an open-source repository of 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability. PUCP-Metrix enables fine-grained, interpretable text analysis. We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines. PUCP-Metrix offers a comprehensive, extensible resource for Spanish, supporting diverse NLP applications.
Related papers
- dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model [24.35392364602848]
dots.ocr is a Vision-Language Model that learns three core tasks within a unified, end-to-end framework.<n>This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus.<n>The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench.
arXiv Detail & Related papers (2025-12-02T07:42:38Z) - Multimodal Evaluation of Russian-language Architectures [88.00147763684451]
We introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures.<n>The benchmark is instruction-based and encompasses default text, image, audio, and video modalities.<n>Mera Multi provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages.
arXiv Detail & Related papers (2025-11-19T15:43:53Z) - A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics [2.943391000885789]
This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles.<n>We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation.
arXiv Detail & Related papers (2025-10-15T06:57:23Z) - Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review [2.377892000761193]
This review provides a comprehensive overview of available parallel corpora for Indic languages.<n>We critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, and data scarcity.<n>We outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality.
arXiv Detail & Related papers (2025-03-02T21:22:53Z) - Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse [0.0]
We present forePLay, a novel Polish language dataset for erotic content detection.<n>This dataset features over 24k annotated sentences with a multidimensional taxonomy encompassing ambiguity, violence, and social unacceptability dimensions.
arXiv Detail & Related papers (2024-12-23T12:58:18Z) - DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.
It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level.
Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language
for Readability Assessment [0.0]
MultiAzterTest is an open source NLP tool that analyzes texts on over 125 measures of cohesion,language, and readability for English, Spanish and Basque.
Using cross-lingual features, MultiAzterTest also obtains competitive results above all in a complex vs simple distinction.
arXiv Detail & Related papers (2021-09-10T13:34:52Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.