Related papers: ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships

ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships

URL: http://arxiv.org/abs/2503.08803v1
Date: Tue, 11 Mar 2025 18:32:16 GMT
Title: ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships
Authors: Johan R. Portela, Nicolás Perez, Rubén Manrique,
Abstract summary: Natural Language Inference (NLI) serves as a crucial area within the domain of Natural Language Processing (NLP)<n>This paper focuses on generating a multi-genre Spanish dataset for NLI, ESNLIR, particularly accounting for causal Relationships.<n>The findings signify that the enrichment of genres essentially contributes to the enrichment of the model's capability to generalize.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), serves as a crucial area within the domain of Natural Language Processing (NLP). This area fundamentally empowers machines to discern semantic relationships between assorted sections of text. Even though considerable work has been executed for the English language, it has been observed that efforts for the Spanish language are relatively sparse. Keeping this in view, this paper focuses on generating a multi-genre Spanish dataset for NLI, ESNLIR, particularly accounting for causal Relationships. A preliminary baseline has been conceptualized and subjected to an evaluation, leveraging models drawn from the BERT family. The findings signify that the enrichment of genres essentially contributes to the enrichment of the model's capability to generalize. The code, notebooks and whole datasets for this experiments is available at: https://zenodo.org/records/15002575. If you are interested only in the dataset you can find it here: https://zenodo.org/records/15002371.

Related papers

The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations [0.0]
The taggedPBC contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages.<n>The accuracy of tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages.<n>A novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of word order in three typological databases.
arXiv Detail & Related papers (2025-05-18T22:13:32Z)
BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER -- a collection of multi-labeled datasets in 28 different languages. We describe the data collection and annotation processes and the challenges of building these datasets. We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z)
MASIVE: Open-Ended Affective State Identification in English and Spanish [10.41502827362741]
In this work, we broaden our scope to a practically unbounded set of textitaffective states, which includes any terms that humans use to describe their experiences of feeling. We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each. On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states.
arXiv Detail & Related papers (2024-07-16T21:43:47Z)
A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding. There is no publicly available NLI corpus for the Romanian language. We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z)
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages [44.017657230247934]
We present textitSemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences.
arXiv Detail & Related papers (2024-02-13T18:04:53Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training [0.34998703934432673]
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP) To overcome this limitation, we create a dedicated data set from publicly available resources. We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches.
arXiv Detail & Related papers (2023-07-27T07:40:11Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers [0.0]
This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. It aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese.
arXiv Detail & Related papers (2022-12-01T15:24:19Z)
Dataset Geography: Mapping Language Data to Language Users [17.30955185832338]
We study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency. Last, we explore some geographical and economic factors that may explain the observed distributions dataset.
arXiv Detail & Related papers (2021-12-07T05:13:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.