Monolingual and Cross-Lingual Acceptability Judgments with the Italian
CoLA corpus
- URL: http://arxiv.org/abs/2109.12053v1
- Date: Fri, 24 Sep 2021 16:18:53 GMT
- Title: Monolingual and Cross-Lingual Acceptability Judgments with the Italian
CoLA corpus
- Authors: Daniela Trotta, Raffaele Guarasci, Elisa Leonardelli, Sara Tonelli
- Abstract summary: We describe the ItaCoLA corpus, containing almost 10,000 sentences with acceptability judgments.
We also present the first cross-lingual experiments, aimed at assessing whether multilingual transformerbased approaches can benefit from using sentences in two languages during fine-tuning.
- Score: 2.418273287232718
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of automated approaches to linguistic acceptability has been
greatly fostered by the availability of the English CoLA corpus, which has also
been included in the widely used GLUE benchmark. However, this kind of research
for languages other than English, as well as the analysis of cross-lingual
approaches, has been hindered by the lack of resources with a comparable size
in other languages. We have therefore developed the ItaCoLA corpus, containing
almost 10,000 sentences with acceptability judgments, which has been created
following the same approach and the same steps as the English one. In this
paper we describe the corpus creation, we detail its content, and we present
the first experiments on this new resource. We compare in-domain and
out-of-domain classification, and perform a specific evaluation of nine
linguistic phenomena. We also present the first cross-lingual experiments,
aimed at assessing whether multilingual transformerbased approaches can benefit
from using sentences in two languages during fine-tuning.
Related papers
- Predictability and Causality in Spanish and English Natural Language Generation [6.817247544942709]
This paper compares causal and non-causal language modeling for English and Spanish.
According to this experiment, Spanish is more predictable than English given a non-causal context.
These insights support further research in NLG in Spanish using bidirectional transformer language models.
arXiv Detail & Related papers (2024-08-26T14:09:28Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Sentiment Classification of Code-Switched Text using Pre-trained
Multilingual Embeddings and Segmentation [1.290382979353427]
We propose a multi-step natural language processing algorithm for code-switched sentiment analysis.
The proposed algorithm can be expanded for sentiment analysis of multiple languages with limited human expertise.
arXiv Detail & Related papers (2022-10-29T01:52:25Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Evaluating Language Tools for Fifteen EU-official Under-resourced
Languages [0.0]
This article presents the results of the evaluation campaign of language tools available for fifteen EU-official under-resourced languages.
The evaluation was conducted within the MSC ITN CLEOPATRA action that aims at building the cross-lingual event-centric knowledge processing.
arXiv Detail & Related papers (2020-10-23T14:21:03Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.