Czech News Dataset for Semantic Textual Similarity
- URL: http://arxiv.org/abs/2108.08708v2
- Date: Mon, 23 Aug 2021 07:12:06 GMT
- Title: Czech News Dataset for Semantic Textual Similarity
- Authors: Jakub Sido, Michal Sej\'ak, Ond\v{r}ej Pra\v{z}\'ak, Miloslav
Konop\'ik, V\'aclav Moravec
- Abstract summary: This paper describes a novel dataset consisting of sentences with semantic similarity annotations.
The data originate from the journalistic domain in the Czech language.
The dataset contains 138,556 human annotations divided into train and test sets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper describes a novel dataset consisting of sentences with semantic
similarity annotations. The data originate from the journalistic domain in the
Czech language. We describe the process of collecting and annotating the data
in detail. The dataset contains 138,556 human annotations divided into train
and test sets. In total, 485 journalism students participated in the creation
process. To increase the reliability of the test set, we compute the annotation
as an average of 9 individual annotations. We evaluate the quality of the
dataset by measuring inter and intra annotation annotators' agreements. Beside
agreement numbers, we provide detailed statistics of the collected dataset. We
conclude our paper with a baseline experiment of building a system for
predicting the semantic similarity of sentences. Due to the massive number of
training annotations (116 956), the model can perform significantly better than
an average annotator (0,92 versus 0,86 of Person's correlation coefficients).
Related papers
- Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale [1.6058099298620423]
The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts.<n>The dataset includes scores generated by GPT-5 as an experiment on annotation automation.
arXiv Detail & Related papers (2025-12-10T13:22:16Z) - Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks [0.7874708385247352]
This paper introduces a novel dataset for aspect-based sentiment analysis (ABSA)<n>It consists of 3.1K manually annotated reviews from the restaurant domain.<n>We provide 24M reviews without annotations suitable for unsupervised learning.
arXiv Detail & Related papers (2025-08-11T16:03:28Z) - Annotator in the Loop: A Case Study of In-Depth Rater Engagement to Create a Bridging Benchmark Dataset [1.825224193230824]
We describe a novel, collaborative, and iterative annotator-in-the-loop methodology for annotation.
Our findings indicate that collaborative engagement with annotators can enhance annotation methods.
arXiv Detail & Related papers (2024-08-01T19:11:08Z) - Paloma: A Benchmark for Evaluating Language Model Fit [114.63031978259467]
Language Model Assessment (Paloma) measures fit to 585 text domains.
We populate our benchmark with results from baselines pretrained on popular corpora.
arXiv Detail & Related papers (2023-12-16T19:12:45Z) - LyricSIM: A novel Dataset and Benchmark for Similarity Detection in
Spanish Song LyricS [52.77024349608834]
We present a new dataset and benchmark tailored to the task of semantic similarity in song lyrics.
Our dataset, originally consisting of 2775 pairs of Spanish songs, was annotated in a collective annotation experiment by 63 native annotators.
arXiv Detail & Related papers (2023-06-02T07:48:20Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Rooms with Text: A Dataset for Overlaying Text Detection [0.18275108630751835]
We introduce a new dataset of room interior pictures with overlaying and scene text, totalling to 4836 annotated images in 25 product categories.
We propose a baseline method for overlaying text detection, that leverages the character region-aware text detection framework to guide the classification model.
arXiv Detail & Related papers (2022-11-21T11:04:41Z) - PART: Pre-trained Authorship Representation Transformer [52.623051272843426]
Authors writing documents imprint identifying information within their texts.<n>Previous works use hand-crafted features or classification tasks to train their authorship models.<n>We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Czech Dataset for Cross-lingual Subjectivity Classification [13.70633147306388]
We introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions.
Two annotators annotated the dataset reaching 0.83 of the Cohen's kappa inter-annotator agreement.
We fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy.
arXiv Detail & Related papers (2022-04-29T07:31:46Z) - What Makes Sentences Semantically Related: A Textual Relatedness Dataset
and Empirical Study [31.062129406113588]
We introduce a dataset for Semantic Textual Relatedness, STR-2022, that has 5,500 English sentence pairs manually annotated.
We show that human intuition regarding relatedness of sentence pairs is highly reliable, with a repeat annotation correlation of 0.84.
We also show the utility of STR-2022 for evaluating automatic methods of sentence representation and for various downstream NLP tasks.
arXiv Detail & Related papers (2021-10-10T16:23:54Z) - Few-shot learning through contextual data augmentation [74.20290390065475]
Machine translation models need to adapt to new data to maintain their performance over time.
We show that adaptation on the scale of one to five examples is possible.
Our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.
arXiv Detail & Related papers (2021-03-31T09:05:43Z) - Comparative analysis of word embeddings in assessing semantic similarity
of complex sentences [8.873705500708196]
We study the sentences in existing benchmark datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences.
The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models.
arXiv Detail & Related papers (2020-10-23T19:55:11Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - Learning to Summarize Passages: Mining Passage-Summary Pairs from
Wikipedia Revision Histories [110.54963847339775]
We propose a method for automatically constructing a passage-to-summary dataset by mining the Wikipedia page revision histories.
In particular, the method mines the main body passages and the introduction sentences which are added to the pages simultaneously.
The constructed dataset contains more than one hundred thousand passage-summary pairs.
arXiv Detail & Related papers (2020-04-06T12:11:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.