TyDiP: A Dataset for Politeness Classification in Nine Typologically
Diverse Languages
- URL: http://arxiv.org/abs/2211.16496v1
- Date: Tue, 29 Nov 2022 18:58:15 GMT
- Title: TyDiP: A Dataset for Politeness Classification in Nine Typologically
Diverse Languages
- Authors: Anirudh Srinivasan, Eunsol Choi
- Abstract summary: We study politeness phenomena in nine typologically diverse languages.
We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language.
- Score: 33.540256516320326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study politeness phenomena in nine typologically diverse languages.
Politeness is an important facet of communication and is sometimes argued to be
cultural-specific, yet existing computational linguistic study is limited to
English. We create TyDiP, a dataset containing three-way politeness annotations
for 500 examples in each language, totaling 4.5K examples. We evaluate how well
multilingual models can identify politeness levels -- they show a fairly robust
zero-shot transfer ability, yet fall short of estimated human accuracy
significantly. We further study mapping the English politeness strategy lexicon
into nine languages via automatic translation and lexicon induction, analyzing
whether each strategy's impact stays consistent across languages. Lastly, we
empirically study the complicated relationship between formality and politeness
through transfer experiments. We hope our dataset will support various research
questions and applications, from evaluating multilingual models to constructing
polite multilingual agents.
Related papers
- A Computational Model for the Assessment of Mutual Intelligibility Among
Closely Related Languages [1.5773159234875098]
Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it.
Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments.
We propose a computer-assisted method using the Linear Discriminative Learner to approximate the cognitive processes by which humans learn languages.
arXiv Detail & Related papers (2024-02-05T11:32:13Z) - Languages You Know Influence Those You Learn: Impact of Language
Characteristics on Multi-Lingual Text-to-Text Transfer [4.554080966463776]
Multi-lingual language models (LM) have been remarkably successful in enabling natural language tasks in low-resource languages.
We try to better understand how such models, specifically mT5, transfer *any* linguistic and semantic knowledge across languages.
A key finding of this work is that similarity of syntax, morphology and phonology are good predictors of cross-lingual transfer.
arXiv Detail & Related papers (2022-12-04T07:22:21Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Are pre-trained text representations useful for multilingual and
multi-dimensional language proficiency modeling? [6.294759639481189]
This paper describes our experiments and observations about the role of pre-trained and fine-tuned multilingual embeddings in performing multi-dimensional, multilingual language proficiency classification.
Our results indicate that while fine-tuned embeddings are useful for multilingual proficiency modeling, none of the features achieve consistently best performance for all dimensions of language proficiency.
arXiv Detail & Related papers (2021-02-25T16:23:52Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - TyDi QA: A Benchmark for Information-Seeking Question Answering in
Typologically Diverse Languages [27.588857710802113]
TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs.
We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena.
arXiv Detail & Related papers (2020-03-10T21:11:53Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.