HIT at SemEval-2022 Task 2: Pre-trained Language Model for Idioms
Detection
- URL: http://arxiv.org/abs/2204.06145v1
- Date: Wed, 13 Apr 2022 02:45:04 GMT
- Title: HIT at SemEval-2022 Task 2: Pre-trained Language Model for Idioms
Detection
- Authors: Zheng Chu, Ziqing Yang, Yiming Cui, Zhigang Chen, Ming Liu
- Abstract summary: The same multi-word expressions may have different meanings in different sentences.
They can be divided into two categories, which are literal meaning and idiomatic meaning.
We use a pre-trained language model, which can provide a context-aware sentence embedding.
- Score: 23.576133853110324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The same multi-word expressions may have different meanings in different
sentences. They can be mainly divided into two categories, which are literal
meaning and idiomatic meaning. Non-contextual-based methods perform poorly on
this problem, and we need contextual embedding to understand the idiomatic
meaning of multi-word expressions correctly. We use a pre-trained language
model, which can provide a context-aware sentence embedding, to detect whether
multi-word expression in the sentence is idiomatic usage.
Related papers
- Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs)
We form "semantic tokens" by merging the semantically similar subwords and their embeddings.
inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Detecting Unseen Multiword Expressions in American Sign Language [1.2691047660244332]
We tested two systems that apply word embeddings from GloVe to predict whether or not those lexemes compose a multiword expression.
It became apparent that word embeddings carry data that can detect non-compositionality with decent accuracy.
arXiv Detail & Related papers (2023-09-30T00:54:59Z) - UAlberta at SemEval 2022 Task 2: Leveraging Glosses and Translations for
Multilingual Idiomaticity Detection [4.66831886752751]
We describe the University of Alberta systems for the SemEval-2022 Task 2 on multilingual idiomaticity detection.
Under the assumption that idiomatic expressions are noncompositional, our first method integrates information on the meanings of the individual words of an expression into a binary classifier.
Our second method translates an expression in context, and uses a lexical knowledge base to determine if the translation is literal.
arXiv Detail & Related papers (2022-05-27T16:35:00Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - MICE: Mining Idioms with Contextual Embeddings [0.0]
MICEatic expressions can be problematic for natural language processing applications.
We present an approach that uses contextual embeddings for that purpose.
We show that deep neural networks using both embeddings perform much better than existing approaches.
arXiv Detail & Related papers (2020-08-13T08:56:40Z) - EPIE Dataset: A Corpus For Possible Idiomatic Expressions [11.891511657648941]
We present our English Possibleatic(EPIE) corpus containing 25206 sentences labelled with lexical instances of 717 idiomatic expressions.
We also present the utility of our dataset by using it to train a sequence labelling module and testing on three independent datasets with high accuracy, precision and recall scores.
arXiv Detail & Related papers (2020-06-16T19:43:30Z) - SLAM-Inspired Simultaneous Contextualization and Interpreting for
Incremental Conversation Sentences [0.0]
We propose a method to dynamically estimate the context and interpretations of polysemous words in sequential sentences.
By using the SCAIN algorithm, we can sequentially optimize the interdependence between context and word interpretation while obtaining new interpretations online.
arXiv Detail & Related papers (2020-05-29T16:40:27Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.