Let's Play Mono-Poly: BERT Can Reveal Words' Polysemy Level and
Partitionability into Senses
- URL: http://arxiv.org/abs/2104.14694v1
- Date: Thu, 29 Apr 2021 23:15:13 GMT
- Title: Let's Play Mono-Poly: BERT Can Reveal Words' Polysemy Level and
Partitionability into Senses
- Authors: Aina Gar\'i Soler and Marianna Apidianaki
- Abstract summary: Pre-trained language models (LMs) encode rich information about linguistic structure but their knowledge about lexical polysemy remains unclear.
We propose a novel experimental setup for analysing this knowledge in LMs specifically trained for different languages.
We demonstrate that BERT-derived representations reflect words' polysemy level and their partitionability into senses.
- Score: 4.915907527975786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (LMs) encode rich information about linguistic
structure but their knowledge about lexical polysemy remains unclear. We
propose a novel experimental setup for analysing this knowledge in LMs
specifically trained for different languages (English, French, Spanish and
Greek) and in multilingual BERT. We perform our analysis on datasets carefully
designed to reflect different sense distributions, and control for parameters
that are highly correlated with polysemy such as frequency and grammatical
category. We demonstrate that BERT-derived representations reflect words'
polysemy level and their partitionability into senses. Polysemy-related
information is more clearly present in English BERT embeddings, but models in
other languages also manage to establish relevant distinctions between words at
different polysemy levels. Our results contribute to a better understanding of
the knowledge encoded in contextualised representations and open up new avenues
for multilingual lexical semantics research.
Related papers
- Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Evaluating Contextualized Representations of (Spanish) Ambiguous Words: A New Lexical Resource and Empirical Analysis [2.2530496464901106]
We evaluate semantic representations of Spanish ambiguous nouns in context in a suite of Spanish-language monolingual and multilingual BERT-based models.
We find that various BERT-based LMs' contextualized semantic representations capture some variance in human judgments but fall short of the human benchmark.
arXiv Detail & Related papers (2024-06-20T18:58:11Z) - Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages? [34.38469832305664]
This paper focuses on human values-related concepts (i.e., value concepts) due to their significance for AI safety.
We first empirically confirm the presence of value concepts within LLMs in a multilingual format.
Further analysis on the cross-lingual characteristics of these concepts reveals 3 traits arising from language resource disparities.
arXiv Detail & Related papers (2024-02-28T07:18:39Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z) - Probing Pretrained Language Models for Lexical Semantics [76.73599166020307]
We present a systematic empirical analysis across six typologically diverse languages and five different lexical tasks.
Our results indicate patterns and best practices that hold universally, but also point to prominent variations across languages and tasks.
arXiv Detail & Related papers (2020-10-12T14:24:01Z) - Finding Universal Grammatical Relations in Multilingual BERT [47.74015366712623]
We show that subspaces of mBERT representations recover syntactic tree distances in languages other than English.
We present an unsupervised analysis method that provides evidence mBERT learns representations of syntactic dependency labels.
arXiv Detail & Related papers (2020-05-09T20:46:02Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.