Annotation Guidelines for the Turku Paraphrase Corpus
- URL: http://arxiv.org/abs/2108.07499v2
- Date: Thu, 19 Aug 2021 06:23:31 GMT
- Title: Annotation Guidelines for the Turku Paraphrase Corpus
- Authors: Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri
Skantsi, Jemina Kilpel\"ainen, Hanna-Mari Kupari, Aurora Piirto, Jenna
Saarni, Maija Sev\'on, Otto Tarkka
- Abstract summary: This document describes the annotation guidelines used to construct the Turku Paraphrase Corpus.
Our paraphrase annotation scheme uses the base scale 1-4, where labels 1 and 2 are used for negative candidates (not paraphrases)
In addition to base labeling, the scheme is enriched with additional subcategories (flags) for categorizing different types of paraphrases inside the two positive labels.
- Score: 0.6538951857199963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This document describes the annotation guidelines used to construct the Turku
Paraphrase Corpus. These guidelines were developed together with the corpus
annotation, revising and extending the guidelines regularly during the
annotation work. Our paraphrase annotation scheme uses the base scale 1-4,
where labels 1 and 2 are used for negative candidates (not paraphrases), while
labels 3 and 4 are paraphrases at least in the given context if not everywhere.
In addition to base labeling, the scheme is enriched with additional
subcategories (flags) for categorizing different types of paraphrases inside
the two positive labels, making the annotation scheme suitable for more
fine-grained paraphrase categorization. The annotation scheme is used to
annotate over 100,000 Finnish paraphrase pairs.
Related papers
- Annotation Guidelines for Corpus Novelties: Part 1 -- Named Entity Recognition [3.4955349700835034]
This document describes the guidelines applied during its annotation.
It contains the instructions used by the annotators, as well as a number of examples retrieved from the annotated novels.
arXiv Detail & Related papers (2024-10-03T08:03:40Z) - Annotation Guidelines for Corpus Novelties: Part 2 -- Alias Resolution Version 1.0 [3.4955349700835034]
The Novelties corpus is a collection of novels (and parts of novels) annotated for Alias Resolution.
This document describes the guidelines applied during the annotation process.
arXiv Detail & Related papers (2024-10-01T09:06:52Z) - Segmentation en phrases : ouvrez les guillemets sans perdre le fil [0.08192907805418582]
This paper presents a graph cascade for sentence segmentation of XML documents.
Our proposal offers sentences inside sentences for cases introduced by quotation marks and hyphens, and also pays particular attention to situations involving incises introduced by parentheses and lists introduced by colons.
arXiv Detail & Related papers (2024-07-29T09:02:38Z) - Unsupervised Mapping of Arguments of Deverbal Nouns to Their
Corresponding Verbal Labels [52.940886615390106]
Deverbal nouns are verbs commonly used in written English texts to describe events or actions, as well as their arguments.
The solutions that do exist for handling arguments of nominalized constructions are based on semantic annotation.
We propose to adopt a more syntactic approach, which maps the arguments of deverbal nouns to the corresponding verbal construction.
arXiv Detail & Related papers (2023-06-24T10:07:01Z) - Description-Enhanced Label Embedding Contrastive Learning for Text
Classification [65.01077813330559]
Self-Supervised Learning (SSL) in model learning process and design a novel self-supervised Relation of Relation (R2) classification task.
Relation of Relation Learning Network (R2-Net) for text classification, in which text classification and R2 classification are treated as optimization targets.
external knowledge from WordNet to obtain multi-aspect descriptions for label semantic learning.
arXiv Detail & Related papers (2023-06-15T02:19:34Z) - Contrastive Bootstrapping for Label Refinement [34.55195008779178]
We propose a lightweight contrastive clustering-based bootstrapping method to iteratively refine the labels of passages.
Experiments on NYT and 20News show that our method outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-06-07T15:49:04Z) - Automatic dense annotation of large-vocabulary sign language videos [85.61513254261523]
We propose a simple, scalable framework to vastly increase the density of automatic annotations.
We make these annotations publicly available to support the sign language research community.
arXiv Detail & Related papers (2022-08-04T17:55:09Z) - Hierarchical Context Tagging for Utterance Rewriting [51.251400047377324]
Methods that tag rather than linearly generate sequences have proven stronger in both in- and out-of-domain rewriting settings.
We propose a hierarchical context tagger that mitigates this issue by predicting slotted rules.
Experiments on several benchmarks show that HCT can outperform state-of-the-art rewriting systems by 2 BLEU points.
arXiv Detail & Related papers (2022-06-22T17:09:34Z) - Annotation Curricula to Implicitly Train Non-Expert Annotators [56.67768938052715]
voluntary studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain.
This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations.
We propose annotation curricula, a novel approach to implicitly train annotators.
arXiv Detail & Related papers (2021-06-04T09:48:28Z) - The Annotation Guideline of LST20 Corpus [0.3161954199291541]
The dataset complies to the CoNLL-2003-style format for ease of use.
At a large scale, it consists of 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences.
All 3,745 documents are also annotated with 15 news genres.
arXiv Detail & Related papers (2020-08-12T01:16:45Z) - A Corpus Study and Annotation Schema for Named Entity Recognition and
Relation Extraction of Business Products [68.26059718611914]
We present a corpus study, an annotation schema and associated guidelines, for the annotation of product entity and company-product relation mentions.
We find that although product mentions are often realized as noun phrases, defining their exact extent is difficult due to high boundary ambiguity.
We present a preliminary corpus of English web and social media documents annotated according to the proposed guidelines.
arXiv Detail & Related papers (2020-04-07T11:45:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.