Finding Variants for Construction-Based Dialectometry: A Corpus-Based
Approach to Regional CxGs
- URL: http://arxiv.org/abs/2104.01299v1
- Date: Sat, 3 Apr 2021 02:52:14 GMT
- Title: Finding Variants for Construction-Based Dialectometry: A Corpus-Based
Approach to Regional CxGs
- Authors: Jonathan Dunn
- Abstract summary: This paper develops a construction-based dialectometry capable of identifying previously unknown constructions.
It offers a method for measuring the aggregate similarity between regional CxGs without limiting in advance the set of constructions subject to variation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper develops a construction-based dialectometry capable of identifying
previously unknown constructions and measuring the degree to which a given
construction is subject to regional variation. The central idea is to learn a
grammar of constructions (a CxG) using construction grammar induction and then
to use these constructions as features for dialectometry. This offers a method
for measuring the aggregate similarity between regional CxGs without limiting
in advance the set of constructions subject to variation. The learned CxG is
evaluated on how well it describes held-out test corpora while dialectometry is
evaluated on how well it can model regional varieties of English. Themethod is
tested using two distinct datasets: First, the International Corpus of English
representing eight outer circle varieties; Second, a web-crawled corpus
representing five inner circle varieties. Results show that themethod (1)
produces a grammar with stable quality across sub-sets of a single corpus that
is (2) capable of distinguishing between regional varieties of Englishwith a
high degree of accuracy, thus (3) supporting dialectometricmethods formeasuring
the similarity between varieties of English and (4) measuring the degree to
which each construction is subject to regional variation. This is important for
cognitive sociolinguistics because it operationalizes the idea that competition
between constructions is organized at the functional level so that
dialectometry needs to represent as much of the available functional space as
possible.
Related papers
- Validating and Exploring Large Geographic Corpora [0.76146285961466]
Three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English.
The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations.
arXiv Detail & Related papers (2024-03-13T02:46:17Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Cross-corpus Readability Compatibility Assessment for English Texts [6.225179315266989]
We propose a novel evaluation framework, Cross-corpus text Readability Compatibility Assessment.
The framework encompasses three key components: Corpus: CEFR, CLEC, CLOTH, NES, OSP, and RACE.
Our findings revealed that OSP stood out as significantly different from other datasets.
arXiv Detail & Related papers (2023-06-16T09:15:39Z) - The Better Your Syntax, the Better Your Semantics? Probing Pretrained
Language Models for the English Comparative Correlative [7.03497683558609]
Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics.
We present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC)
Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning.
arXiv Detail & Related papers (2022-10-24T13:01:24Z) - Stability of Syntactic Dialect Classification Over Space and Time [0.0]
This paper constructs a test set for 12 dialects of English that spans three years at monthly intervals with a fixed spatial distribution across 1,120 cities.
The decay rate of classification performance for each dialect over time allows us to identify regions undergoing syntactic change.
And the distribution of classification accuracy within dialect regions allows us to identify the degree to which the grammar of a dialect is internally heterogeneous.
arXiv Detail & Related papers (2022-09-11T23:14:59Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Global Syntactic Variation in Seven Languages: Towards a Computational
Dialectology [0.0]
We use Computational Construction Grammar to provide a replicable and falsifiable set of syntactic features.
We use global language mapping based on web-crawled and social media datasets to determine the selection of national varieties.
Results show that models for each language are able to robustly predict the region-of-origin of held-out samples better using Construction Grammars.
arXiv Detail & Related papers (2021-04-03T03:40:21Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.