Stability of Syntactic Dialect Classification Over Space and Time
- URL: http://arxiv.org/abs/2209.04958v1
- Date: Sun, 11 Sep 2022 23:14:59 GMT
- Title: Stability of Syntactic Dialect Classification Over Space and Time
- Authors: Jonathan Dunn and Sidney Wong
- Abstract summary: This paper constructs a test set for 12 dialects of English that spans three years at monthly intervals with a fixed spatial distribution across 1,120 cities.
The decay rate of classification performance for each dialect over time allows us to identify regions undergoing syntactic change.
And the distribution of classification accuracy within dialect regions allows us to identify the degree to which the grammar of a dialect is internally heterogeneous.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper analyses the degree to which dialect classifiers based on
syntactic representations remain stable over space and time. While previous
work has shown that the combination of grammar induction and geospatial text
classification produces robust dialect models, we do not know what influence
both changing grammars and changing populations have on dialect models. This
paper constructs a test set for 12 dialects of English that spans three years
at monthly intervals with a fixed spatial distribution across 1,120 cities.
Syntactic representations are formulated within the usage-based Construction
Grammar paradigm (CxG). The decay rate of classification performance for each
dialect over time allows us to identify regions undergoing syntactic change.
And the distribution of classification accuracy within dialect regions allows
us to identify the degree to which the grammar of a dialect is internally
heterogeneous. The main contribution of this paper is to show that a rigorous
evaluation of dialect classification models can be used to find both variation
over space and change over time.
Related papers
- Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum [25.732397636695882]
We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity.
This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety.
We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance.
arXiv Detail & Related papers (2024-10-18T16:39:42Z) - Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing.
Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z) - Explainability of machine learning approaches in forensic linguistics: a case study in geolinguistic authorship profiling [46.58131072375399]
We explore the explainability of machine learning approaches considering the forensic context.
We focus on variety classification as a means of geolinguistic profiling of unknown texts based on social media data from the German-speaking area.
We find that the extracted lexical features are indeed representative of their respective varieties and note that the trained models also rely on place names for classifications.
arXiv Detail & Related papers (2024-04-29T08:52:52Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Syntactic Variation Across the Grammar: Modelling a Complex Adaptive
System [0.76146285961466]
We model dialectal variation across 49 local populations of English speakers in 16 countries.
Results show that an important part of syntactic variation consists of interactions between different parts of the grammar.
New Zealand English could be more similar to Australian English in phrasal verbs but at the same time more similar to UK English in dative phrases.
arXiv Detail & Related papers (2023-09-21T08:14:34Z) - DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules [64.93179829965072]
DADA is a modular approach to imbue SAE-trained models with multi-dialectal robustness.
We show that DADA is effective for both single task and instruction fine language models.
arXiv Detail & Related papers (2023-05-22T18:43:31Z) - Variation and Instability in Dialect-Based Embedding Spaces [0.0]
This paper measures variation in embedding spaces which have been trained on different regional varieties of English.
Experiments confirm that embedding spaces are significantly influenced by the dialect represented in the training data.
arXiv Detail & Related papers (2023-03-27T07:53:23Z) - CCPrefix: Counterfactual Contrastive Prefix-Tuning for Many-Class
Classification [57.62886091828512]
We propose a brand-new prefix-tuning method, Counterfactual Contrastive Prefix-tuning (CCPrefix) for many-class classification.
Basically, an instance-dependent soft prefix, derived from fact-counterfactual pairs in the label space, is leveraged to complement the language verbalizers in many-class classification.
arXiv Detail & Related papers (2022-11-11T03:45:59Z) - Compositional Temporal Grounding with Structured Variational Cross-Graph
Correspondence Learning [92.07643510310766]
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We empirically find that they fail to generalize to queries with novel combinations of seen words.
We propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies.
arXiv Detail & Related papers (2022-03-24T12:55:23Z) - Finding Variants for Construction-Based Dialectometry: A Corpus-Based
Approach to Regional CxGs [0.0]
This paper develops a construction-based dialectometry capable of identifying previously unknown constructions.
It offers a method for measuring the aggregate similarity between regional CxGs without limiting in advance the set of constructions subject to variation.
arXiv Detail & Related papers (2021-04-03T02:52:14Z) - Lexical semantic change for Ancient Greek and Latin [61.69697586178796]
Associating a word's correct meaning in its historical context is a central challenge in diachronic research.
We build on a recent computational approach to semantic change based on a dynamic Bayesian mixture model.
We provide a systematic comparison of dynamic Bayesian mixture models for semantic change with state-of-the-art embedding-based models.
arXiv Detail & Related papers (2021-01-22T12:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.