Comparing Styles across Languages
- URL: http://arxiv.org/abs/2310.07135v2
- Date: Tue, 5 Dec 2023 02:18:40 GMT
- Title: Comparing Styles across Languages
- Authors: Shreya Havaldar, Matthew Pressimone, Eric Wong, Lyle Ungar
- Abstract summary: We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages.
Our framework generates comprehensive style lexica in any language.
We apply this framework to compare politeness, creating the first holistic multilingual politeness dataset.
- Score: 12.585216712212437
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding how styles differ across languages is advantageous for training
both humans and computers to generate culturally appropriate text. We introduce
an explanation framework to extract stylistic differences from multilingual LMs
and compare styles across languages. Our framework (1) generates comprehensive
style lexica in any language and (2) consolidates feature importances from LMs
into comparable lexical categories. We apply this framework to compare
politeness, creating the first holistic multilingual politeness dataset and
exploring how politeness varies across four languages. Our approach enables an
effective evaluation of how distinct linguistic categories contribute to
stylistic variations and provides interpretable insights into how people
communicate differently around the world.
Related papers
- Are Structural Concepts Universal in Transformer Language Models?
Towards Interpretable Cross-Lingual Generalization [27.368684663279463]
We investigate the potential for explicitly aligning conceptual correspondence between languages to enhance cross-lingual generalization.
Using the syntactic aspect of language as a testbed, our analyses of 43 languages reveal a high degree of alignability.
We propose a meta-learning-based method to learn to align conceptual spaces of different languages.
arXiv Detail & Related papers (2023-10-19T14:50:51Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Multi-level Contrastive Learning for Cross-lingual Spoken Language
Understanding [90.87454350016121]
We develop novel code-switching schemes to generate hard negative examples for contrastive learning at all levels.
We develop a label-aware joint model to leverage label semantics for cross-lingual knowledge transfer.
arXiv Detail & Related papers (2022-05-07T13:44:28Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Identifying Distributional Perspective Differences from Colingual Groups [41.58939666949895]
A lack of mutual understanding among different groups about their perspectives on specific values or events may lead to uninformed decisions or biased opinions.
We study colingual groups and use language corpora as a proxy to identify their distributional perspectives.
We present a novel computational approach to learn shared understandings, and benchmark our method by building culturally-aware models for the English, Chinese, and Japanese languages.
arXiv Detail & Related papers (2020-04-10T08:13:07Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.