Czech Dataset for Cross-lingual Subjectivity Classification
- URL: http://arxiv.org/abs/2204.13915v1
- Date: Fri, 29 Apr 2022 07:31:46 GMT
- Title: Czech Dataset for Cross-lingual Subjectivity Classification
- Authors: Pavel P\v{r}ib\'a\v{n}, Josef Steinberger
- Abstract summary: We introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions.
Two annotators annotated the dataset reaching 0.83 of the Cohen's kappa inter-annotator agreement.
We fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy.
- Score: 13.70633147306388
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we introduce a new Czech subjectivity dataset of 10k manually
annotated subjective and objective sentences from movie reviews and
descriptions. Our prime motivation is to provide a reliable dataset that can be
used with the existing English dataset as a benchmark to test the ability of
pre-trained multilingual models to transfer knowledge between Czech and English
and vice versa. Two annotators annotated the dataset reaching 0.83 of the
Cohen's \k{appa} inter-annotator agreement. To the best of our knowledge, this
is the first subjectivity dataset for the Czech language. We also created an
additional dataset that consists of 200k automatically labeled sentences. Both
datasets are freely available for research purposes. Furthermore, we fine-tune
five pre-trained BERT-like models to set a monolingual baseline for the new
dataset and we achieve 93.56% of accuracy. We fine-tune models on the existing
English dataset for which we obtained results that are on par with the current
state-of-the-art results. Finally, we perform zero-shot cross-lingual
subjectivity classification between Czech and English to verify the usability
of our dataset as the cross-lingual benchmark. We compare and discuss the
cross-lingual and monolingual results and the ability of multilingual models to
transfer knowledge between languages.
Related papers
- CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - A Commonsense-Infused Language-Agnostic Learning Framework for Enhancing
Prediction of Political Polarity in Multilingual News Headlines [0.0]
We use the method of translation and retrieval to acquire the inferential knowledge in the target language.
We then employ an attention mechanism to emphasise important inferences.
We present a dataset of over 62.6K multilingual news headlines in five European languages annotated with their respective political polarities.
arXiv Detail & Related papers (2022-12-01T06:07:01Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+
Language Pairs [27.574815708395203]
CrossSum is a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs.
We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset.
We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language.
arXiv Detail & Related papers (2021-12-16T11:40:36Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.