Related papers: Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties

Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties

URL: http://arxiv.org/abs/2209.07611v1
Date: Thu, 15 Sep 2022 21:19:31 GMT
Title: Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties
Authors: Tessa Masis, Anissa Neal, Lisa Green, Brendan O'Connor
Abstract summary: We present a human-in-the-loop approach to generate and filter effective contrast sets via corpus-guided edits. We show that our approach improves feature detection for both Indian English and African American English, demonstrate how it can assist linguistic research, and release our fine-tuned models for use by other researchers.
Score: 3.3536302616846734
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The study of language variation examines how language varies between and within different groups of speakers, shedding light on how we use language to construct identities and how social contexts affect language use. A common method is to identify instances of a certain linguistic feature - say, the zero copula construction - in a corpus, and analyze the feature's distribution across speakers, topics, and other variables, to either gain a qualitative understanding of the feature's function or systematically measure variation. In this paper, we explore the challenging task of automatic morphosyntactic feature detection in low-resource English varieties. We present a human-in-the-loop approach to generate and filter effective contrast sets via corpus-guided edits. We show that our approach improves feature detection for both Indian English and African American English, demonstrate how it can assist linguistic research, and release our fine-tuned models for use by other researchers.

Related papers

Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world? [0.7168794329741259]
This study employs embeddings from the fine-tuned XLS-R self-supervised language identification model vox107-xls-r-300m-wav2vec to analyze relationships between 106 world languages.<n>Using linear discriminant analysis (LDA), language embeddings are clustered and compared with genealogical, lexical, and geographical distances.<n>The results demonstrate that embedding-based distances align closely with traditional measures, effectively capturing both global and local typological patterns.
arXiv Detail & Related papers (2025-06-10T08:33:34Z)
Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement [1.4335183427838039]
We take the approach of developing curated synthetic data on a large scale, with specific properties. We use a new multiple-choice task and datasets, Blackbird Language Matrices, to focus on a specific grammatical structural phenomenon. We show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences.
arXiv Detail & Related papers (2024-09-10T14:58:55Z)
Assessing the Role of Lexical Semantics in Cross-lingual Transfer through Controlled Manipulations [15.194196775504613]
We analyze how differences between English and a target language influence the capacity to align the language with an English pretrained representation space. We show that while properties such as the script or word order only have a limited impact on alignment quality, the degree of lexical matching between the two languages, which we define using a measure of translation entropy, greatly affects it.
arXiv Detail & Related papers (2024-08-14T14:59:20Z)
Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies. We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z)
Multilingual Few-Shot Learning via Language Model Retrieval [18.465566186549072]
Transformer-based language models have achieved remarkable success in few-shot in-context learning. We conduct a study of retrieving semantically similar few-shot samples and using them as the context. We evaluate the proposed method on five natural language understanding datasets related to intent detection, question classification, sentiment analysis, and topic classification.
arXiv Detail & Related papers (2023-06-19T14:27:21Z)
A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics. Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z)
Controlling Extra-Textual Attributes about Dialogue Participants: A Case Study of English-to-Polish Neural Machine Translation [4.348327991071386]
Machine translation models need to opt for a certain interpretation of textual context when translating from English to Polish. We propose a case study where a wide range of approaches for controlling attributes in translation is employed. The best model achieves an improvement of +5.81 chrF++/+6.03 BLEU, with other models achieving competitive performance.
arXiv Detail & Related papers (2022-05-10T08:45:39Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
On the Importance of Word Order Information in Cross-lingual Sequence Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages. We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.