Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers
- URL: http://arxiv.org/abs/2402.17914v2
- Date: Sat, 23 Mar 2024 20:21:04 GMT
- Title: Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers
- Authors: Roy Xie, Orevaoghene Ahia, Yulia Tsvetkov, Antonios Anastasopoulos,
- Abstract summary: We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialects.
We experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.
- Score: 43.756851270091516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying linguistic differences between dialects of a language often requires expert knowledge and meticulous human analysis. This is largely due to the complexity and nuance involved in studying various dialects. We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialect classifiers, even in the absence of human experts. We explore both post-hoc and intrinsic approaches to interpretability, conduct experiments on Mandarin, Italian, and Low Saxon, and experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.
Related papers
- Disentangling Dialect from Social Bias via Multitask Learning to Improve Fairness [16.746758715820324]
We present a multitask learning approach that models dialect language as an auxiliary task to incorporate syntactic and lexical variations.
In our experiments with African-American English dialect, we provide empirical evidence that complementing common learning approaches with dialect modeling improves their fairness.
Results suggest that multitask learning achieves state-of-the-art performance and helps to detect properties of biased language more reliably.
arXiv Detail & Related papers (2024-06-14T12:39:39Z) - Explainability of machine learning approaches in forensic linguistics: a case study in geolinguistic authorship profiling [46.58131072375399]
We explore the explainability of machine learning approaches considering the forensic context.
We focus on variety classification as a means of geolinguistic profiling of unknown texts based on social media data from the German-speaking area.
We find that the extracted lexical features are indeed representative of their respective varieties and note that the trained models also rely on place names for classifications.
arXiv Detail & Related papers (2024-04-29T08:52:52Z) - BabySLM: language-acquisition-friendly benchmark of self-supervised
spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels.
We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels.
We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z) - Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in
Low-Resource English Varieties [3.3536302616846734]
We present a human-in-the-loop approach to generate and filter effective contrast sets via corpus-guided edits.
We show that our approach improves feature detection for both Indian English and African American English, demonstrate how it can assist linguistic research, and release our fine-tuned models for use by other researchers.
arXiv Detail & Related papers (2022-09-15T21:19:31Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Morphological Disambiguation from Stemming Data [1.2183405753834562]
Kinyarwanda, a morphologically rich language, currently lacks tools for automated morphological analysis.
We learn to morphologically disambiguate Kinyarwanda verbal forms from a new stemming dataset collected through crowd-sourcing.
Our experiments reveal that inflectional properties of stems and morpheme association rules are the most discriminative features for disambiguation.
arXiv Detail & Related papers (2020-11-11T01:44:09Z) - Rediscovering the Slavic Continuum in Representations Emerging from
Neural Models of Spoken Language Identification [16.369477141866405]
We present a neural model for Slavic language identification in speech signals.
We analyze its emergent representations to investigate whether they reflect objective measures of language relatedness.
arXiv Detail & Related papers (2020-10-22T18:18:19Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.