Variation and Instability in Dialect-Based Embedding Spaces
- URL: http://arxiv.org/abs/2303.14963v1
- Date: Mon, 27 Mar 2023 07:53:23 GMT
- Title: Variation and Instability in Dialect-Based Embedding Spaces
- Authors: Jonathan Dunn
- Abstract summary: This paper measures variation in embedding spaces which have been trained on different regional varieties of English.
Experiments confirm that embedding spaces are significantly influenced by the dialect represented in the training data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper measures variation in embedding spaces which have been trained on
different regional varieties of English while controlling for instability in
the embeddings. While previous work has shown that it is possible to
distinguish between similar varieties of a language, this paper experiments
with two follow-up questions: First, does the variety represented in the
training data systematically influence the resulting embedding space after
training? This paper shows that differences in embeddings across varieties are
significantly higher than baseline instability. Second, is such dialect-based
variation spread equally throughout the lexicon? This paper shows that specific
parts of the lexicon are particularly subject to variation. Taken together,
these experiments confirm that embedding spaces are significantly influenced by
the dialect represented in the training data. This finding implies that there
is semantic variation across dialects, in addition to previously-studied
lexical and syntactic variation.
Related papers
- Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum [25.732397636695882]
We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity.
This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety.
We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance.
arXiv Detail & Related papers (2024-10-18T16:39:42Z) - Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus [0.0]
We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags.
We find indications that the "dialect effect" produced by intentional orthographic variation employs multiple linguistic channels.
arXiv Detail & Related papers (2024-10-03T16:58:21Z) - The Lou Dataset -- Exploring the Impact of Gender-Fair Language in German Text Classification [57.06913662622832]
Gender-fair language fosters inclusion by addressing all genders or using neutral forms.
Gender-fair language substantially impacts predictions by flipping labels, reducing certainty, and altering attention patterns.
While we offer initial insights on the effect on German text classification, the findings likely apply to other languages.
arXiv Detail & Related papers (2024-09-26T15:08:17Z) - Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing.
Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z) - Cross-Linguistic Syntactic Difference in Multilingual BERT: How Good is
It and How Does It Affect Transfer? [50.48082721476612]
Multilingual BERT (mBERT) has demonstrated considerable cross-lingual syntactic ability.
We investigate the distributions of grammatical relations induced from mBERT in the context of 24 typologically different languages.
arXiv Detail & Related papers (2022-12-21T09:44:08Z) - Stability of Syntactic Dialect Classification Over Space and Time [0.0]
This paper constructs a test set for 12 dialects of English that spans three years at monthly intervals with a fixed spatial distribution across 1,120 cities.
The decay rate of classification performance for each dialect over time allows us to identify regions undergoing syntactic change.
And the distribution of classification accuracy within dialect regions allows us to identify the degree to which the grammar of a dialect is internally heterogeneous.
arXiv Detail & Related papers (2022-09-11T23:14:59Z) - Contextualized language models for semantic change detection: lessons
learned [4.436724861363513]
We present a qualitative analysis of the outputs of contextualized embedding-based methods for detecting diachronic semantic change.
Our findings show that contextualized methods can often predict high change scores for words which are not undergoing any real diachronic semantic shift.
Our conclusion is that pre-trained contextualized language models are prone to confound changes in lexicographic senses and changes in contextual variance.
arXiv Detail & Related papers (2022-08-31T23:35:24Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - A Matter of Framing: The Impact of Linguistic Formalism on Probing
Results [69.36678873492373]
Deep pre-trained contextualized encoders like BERT (Delvin et al.) demonstrate remarkable performance on a range of downstream tasks.
Recent research in probing investigates the linguistic knowledge implicitly learned by these models during pre-training.
Can the choice of formalism affect probing results?
We find linguistically meaningful differences in the encoding of semantic role- and proto-role information by BERT depending on the formalism.
arXiv Detail & Related papers (2020-04-30T17:45:16Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.