OYXOY: A Modern NLP Test Suite for Modern Greek
- URL: http://arxiv.org/abs/2309.07009v2
- Date: Fri, 26 Jan 2024 16:45:23 GMT
- Title: OYXOY: A Modern NLP Test Suite for Modern Greek
- Authors: Konstantinos Kogkalidis, Stergios Chatzikyriakidis, Eirini
Chrysovalantou Giannikouri, Vassiliki Katsouli, Christina Klironomou,
Christina Koula, Dimitris Papadakis, Thelka Pasparaki, Erofili Psaltaki,
Efthymia Sakellariou, Hara Soupiona
- Abstract summary: This paper serves as a foundational step towards the development of a linguistically motivated evaluation suite for Greek NLP.
We introduce four expert-verified evaluation tasks, specifically targeted at natural language inference, word sense disambiguation and metaphor detection.
More than language-resourced replicas of existing tasks, we contribute two innovations which will resonate with the broader resource and evaluation community.
- Score: 2.059776592203642
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper serves as a foundational step towards the development of a
linguistically motivated and technically relevant evaluation suite for Greek
NLP. We initiate this endeavor by introducing four expert-verified evaluation
tasks, specifically targeted at natural language inference, word sense
disambiguation (through example comparison or sense selection) and metaphor
detection. More than language-adapted replicas of existing tasks, we contribute
two innovations which will resonate with the broader resource and evaluation
community. Firstly, our inference dataset is the first of its kind, marking not
just \textit{one}, but rather \textit{all} possible inference labels,
accounting for possible shifts due to e.g. ambiguity or polysemy. Secondly, we
demonstrate a cost-efficient method to obtain datasets for under-resourced
languages. Using ChatGPT as a language-neutral parser, we transform the
Dictionary of Standard Modern Greek into a structured format, from which we
derive the other three tasks through simple projections. Alongside each task,
we conduct experiments using currently available state of the art machinery.
Our experimental baselines affirm the challenging nature of our tasks and
highlight the need for expedited progress in order for the Greek NLP ecosystem
to keep pace with contemporary mainstream research.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Simple Yet Effective Neural Ranking and Reranking Baselines for
Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another.
We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold.
We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - Multi-granular Legal Topic Classification on Greek Legislation [4.09134848993518]
We study the task of classifying legal texts written in the Greek language.
This is the first time the task of Greek legal text classification is considered in an open research project.
arXiv Detail & Related papers (2021-09-30T17:43:00Z) - The Rediscovery Hypothesis: Language Models Need to Meet Linguistics [8.293055016429863]
We study whether linguistic knowledge is a necessary condition for good performance of modern language models.
We show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures.
This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objective with linguistic information.
arXiv Detail & Related papers (2021-03-02T15:57:39Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - GREEK-BERT: The Greeks visiting Sesame Street [25.406207104603027]
Transformer-based language models, such as BERT, have achieved state-of-the-art performance in several downstream natural language processing tasks.
We present GREEK-BERT, a monolingual BERT-based language model for modern Greek.
arXiv Detail & Related papers (2020-08-27T09:36:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.