Potential Idiomatic Expression (PIE)-English: Corpus for Classes of
Idioms
- URL: http://arxiv.org/abs/2105.03280v1
- Date: Sun, 25 Apr 2021 13:05:29 GMT
- Title: Potential Idiomatic Expression (PIE)-English: Corpus for Classes of
Idioms
- Authors: Tosin P. Adewumi, Saleha Javed, Roshanak Vadoodi, Aparajita Tripathy,
Konstantina Nikolaidou, Foteini Liwicki and Marcus Liwicki
- Abstract summary: This is the first dataset with classes of idioms beyond the literal and the general idioms classification.
This dataset contains over 20,100 samples with almost 1,200 cases of idioms (with their meanings) from 10 classes (or senses)
- Score: 1.6111818380407035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a fairly large, Potential Idiomatic Expression (PIE) dataset for
Natural Language Processing (NLP) in English. The challenges with NLP systems
with regards to tasks such as Machine Translation (MT), word sense
disambiguation (WSD) and information retrieval make it imperative to have a
labelled idioms dataset with classes such as it is in this work. To the best of
the authors' knowledge, this is the first idioms corpus with classes of idioms
beyond the literal and the general idioms classification. In particular, the
following classes are labelled in the dataset: metaphor, simile, euphemism,
parallelism, personification, oxymoron, paradox, hyperbole, irony and literal.
Many past efforts have been limited in the corpus size and classes of samples
but this dataset contains over 20,100 samples with almost 1,200 cases of idioms
(with their meanings) from 10 classes (or senses). The corpus may also be
extended by researchers to meet specific needs. The corpus has part of speech
(PoS) tagging from the NLTK library. Classification experiments performed on
the corpus to obtain a baseline and comparison among three common models,
including the BERT model, give good results. We also make publicly available
the corpus and the relevant codes for working with it for NLP tasks.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - Vector Representations of Idioms in Conversational Systems [1.6507910904669727]
We utilize the Potentialatic Expression (PIE)-English idioms corpus for the two tasks that we investigate.
We achieve state-of-the-art (SoTA) result of 98% macro F1 score on the classification task by using the SoTA T5 model.
The results show that the model trained on the idiom corpus generates more fitting responses to prompts containing idioms 71.9% of the time.
arXiv Detail & Related papers (2022-05-07T14:50:05Z) - Cross-Lingual Phrase Retrieval [49.919180978902915]
Cross-lingual retrieval aims to retrieve relevant text across languages.
Current methods typically achieve cross-lingual retrieval by learning language-agnostic text representations in word or sentence level.
We propose XPR, a cross-lingual phrase retriever that extracts phrase representations from unlabeled example sentences.
arXiv Detail & Related papers (2022-04-19T13:35:50Z) - Cross-lingual Transfer for Text Classification with Dictionary-based
Heterogeneous Graph [10.64488240379972]
In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available.
Collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns.
This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionaries.
arXiv Detail & Related papers (2021-09-09T16:40:40Z) - CompLex: A New Corpus for Lexical Complexity Prediction from Likert
Scale Data [13.224233182417636]
This paper presents the first English dataset for continuous lexical complexity prediction.
We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts.
arXiv Detail & Related papers (2020-03-16T03:54:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.