Related papers: A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

URL: http://arxiv.org/abs/2602.11795v1
Date: Thu, 12 Feb 2026 10:19:50 GMT
Title: A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments
Authors: Anne-Marie Lutgen, Alistair Plum, Christoph Purschke,
Abstract summary: This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists.<n>The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity.<n>Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research.
Score: 2.4384521157164345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.

Related papers

An Interpretable Deep Learning Approach for Morphological Script Type Analysis [15.142597136864618]
We propose an interpretable deep learning-based approach to morphological script type analysis. More precisely, we adapt a deep instance segmentation method to learn comparable character prototypes. We demonstrate our approach by applying it to the Textualis Formata script type and its two subtypes formalized by A. Derolez: Northern and Southern Textualis
arXiv Detail & Related papers (2024-08-20T19:15:06Z)
Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z)
Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z)
Multilingual Few-Shot Learning via Language Model Retrieval [18.465566186549072]
Transformer-based language models have achieved remarkable success in few-shot in-context learning. We conduct a study of retrieving semantically similar few-shot samples and using them as the context. We evaluate the proposed method on five natural language understanding datasets related to intent detection, question classification, sentiment analysis, and topic classification.
arXiv Detail & Related papers (2023-06-19T14:27:21Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties [3.3536302616846734]
We present a human-in-the-loop approach to generate and filter effective contrast sets via corpus-guided edits. We show that our approach improves feature detection for both Indian English and African American English, demonstrate how it can assist linguistic research, and release our fine-tuned models for use by other researchers.
arXiv Detail & Related papers (2022-09-15T21:19:31Z)
Contextualized language models for semantic change detection: lessons learned [4.436724861363513]
We present a qualitative analysis of the outputs of contextualized embedding-based methods for detecting diachronic semantic change. Our findings show that contextualized methods can often predict high change scores for words which are not undergoing any real diachronic semantic shift. Our conclusion is that pre-trained contextualized language models are prone to confound changes in lexicographic senses and changes in contextual variance.
arXiv Detail & Related papers (2022-08-31T23:35:24Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.<n>We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements. We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
Morphologically Aware Word-Level Translation [82.59379608647147]
We propose a novel morphologically aware probability model for bilingual lexicon induction. Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning.
arXiv Detail & Related papers (2020-11-15T17:54:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.