Related papers: Turkish Delights: a Dataset on Turkish Euphemisms

Turkish Delights: a Dataset on Turkish Euphemisms

URL: http://arxiv.org/abs/2407.13040v1
Date: Wed, 17 Jul 2024 22:13:42 GMT
Title: Turkish Delights: a Dataset on Turkish Euphemisms
Authors: Hasan Can Biyik, Patrick Lee, Anna Feldman,
Abstract summary: This research extends the current computational work on potentially euphemistic terms (PETs) to Turkish. We introduce the Turkish PET dataset, the first available of its kind in the field. We provide both euphemistic and non-euphemistic examples of PETs in Turkish.
Score: 1.7614751781649955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Euphemisms are a form of figurative language relatively understudied in natural language processing. This research extends the current computational work on potentially euphemistic terms (PETs) to Turkish. We introduce the Turkish PET dataset, the first available of its kind in the field. By creating a list of euphemisms in Turkish, collecting example contexts, and annotating them, we provide both euphemistic and non-euphemistic examples of PETs in Turkish. We describe the dataset and methodologies, and also experiment with transformer-based models on Turkish euphemism detection by using our dataset for binary classification. We compare performances across models using F1, accuracy, and precision as evaluation metrics.

Related papers

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models [0.0]
This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language. We also introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts.
arXiv Detail & Related papers (2025-01-08T20:29:00Z)
Investigating Gender Bias in Turkish Language Models [3.100560442806189]
We investigate the significance of gender bias in Turkish language models. We build upon existing bias evaluation frameworks and extend them to the Turkish language. Specifically, we evaluate Turkish language models for their embedded ethnic bias toward Kurdish people.
arXiv Detail & Related papers (2024-04-17T20:24:41Z)
Cross-Lingual Learning vs. Low-Resource Fine-Tuning: A Case Study with Fact-Checking in Turkish [0.9217021281095907]
We introduce the FCTR dataset, consisting of 3238 real-world claims. This dataset spans multiple domains and incorporates evidence collected from three Turkish fact-checking organizations.
arXiv Detail & Related papers (2024-03-01T09:57:46Z)
Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks [0.0]
We provide a Transformer-based model and a baseline benchmark for the Turkish Language. We successfully fine-tuned a Turkish BERT model, namely BERTurk, to many downstream tasks and evaluated with a the Turkish Benchmark dataset.
arXiv Detail & Related papers (2024-01-30T19:27:04Z)
Semantic Change Detection for the Romanian Language [0.5202524136984541]
We analyze different strategies to create static and contextual word embedding models on real-world datasets. We first evaluate both word embedding models on an English dataset (SEMEVAL-CCOHA) and then on a Romanian dataset. The experimental results show that, depending on the corpus, the most important factors to consider are the choice of model and the distance to calculate a score for detecting semantic change.
arXiv Detail & Related papers (2023-08-23T13:37:02Z)
FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms [3.1648534725322666]
We present novel euphemism corpora in three different languages: Yoruba, Spanish, and Mandarin Chinese. We find that transformers are generally better at classifying vague PETs. We perform euphemism disambiguation experiments in each language using multilingual transformer models mBERT and XLM-RoBERTa.
arXiv Detail & Related papers (2023-05-31T22:23:20Z)
Characterizing and Measuring Linguistic Dataset Drift [65.28821163863665]
We propose three dimensions of linguistic dataset drift: vocabulary, structural, and semantic drift. These dimensions correspond to content word frequency divergences, syntactic divergences, and meaning changes not captured by word frequencies. We find that our drift metrics are more effective than previous metrics at predicting out-of-domain model accuracies.
arXiv Detail & Related papers (2023-05-26T17:50:51Z)
Retrieval-based Disentangled Representation Learning with Natural Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z)
Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions. Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z)
Automatically Identifying Semantic Bias in Crowdsourced Natural Language Inference Datasets [78.6856732729301]
We introduce a model-driven, unsupervised technique to find "bias clusters" in a learned embedding space of hypotheses in NLI datasets. interventions and additional rounds of labeling can be performed to ameliorate the semantic bias of the hypothesis distribution of a dataset.
arXiv Detail & Related papers (2021-12-16T22:49:01Z)
Did the Cat Drink the Coffee? Challenging Transformers with Generalized Event Knowledge [59.22170796793179]
Transformers Language Models (TLMs) were tested on a benchmark for the textitdynamic estimation of thematic fit Our results show that TLMs can reach performances that are comparable to those achieved by SDM. However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge.
arXiv Detail & Related papers (2021-07-22T20:52:26Z)
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models [51.53936551681613]
We show that fine-tuning only the bias terms (or a subset of the bias terms) of pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. They support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.
arXiv Detail & Related papers (2021-06-18T16:09:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.