WikiNER-fr-gold: A Gold-Standard NER Corpus
- URL: http://arxiv.org/abs/2411.00030v1
- Date: Tue, 29 Oct 2024 08:00:16 GMT
- Title: WikiNER-fr-gold: A Gold-Standard NER Corpus
- Authors: Danrun Cao, Nicolas Béchet, Pierre-François Marteau,
- Abstract summary: We address the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it.
We propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER.
We present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.
- Score: 1.7205106391379026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address in this article the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it. The annotation of WikiNER was produced in a semi-supervised manner i.e. no manual verification has been carried out a posteriori. Such corpus is called silver-standard. In this paper we propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER. Our corpus consists of randomly sampled 20% of the original French sub-corpus (26,818 sentences with 700k tokens). We start by summarizing the entity types included in each category in order to define an annotation guideline, and then we proceed to revise the corpus. Finally we present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.
Related papers
- Dialectal and Low Resource Machine Translation for Aromanian [44.99833362998488]
We present a neural machine translation system that can translate between Romanian, English, and Aromanian.
BLEU scores range from 17 to 32 depending on the direction and genre of the text.
We release the biggest known Aromanian-Romanian bilingual corpus, consisting of 79k cleaned sentence pairs.
arXiv Detail & Related papers (2024-10-23T10:00:23Z) - People and Places of Historical Europe: Bootstrapping Annotation
Pipeline and a New Corpus of Named Entities in Late Medieval Texts [0.0]
We develop a new NER corpus of 3.6M sentences from late medieval charters written mainly in Czech, Latin, and German.
We show that we can start with a list of known historical figures and locations and an unannotated corpus of historical texts, and use information retrieval techniques to automatically bootstrap a NER-annotated corpus.
arXiv Detail & Related papers (2023-05-26T08:05:01Z) - Carolina: a General Corpus of Contemporary Brazilian Portuguese with
Provenance, Typology and Versioning Information [0.629199190108771]
Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology.
Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types.
arXiv Detail & Related papers (2023-03-28T16:09:40Z) - FreCDo: A Large Corpus for French Cross-Domain Dialect Identification [22.132457694021184]
We present a novel corpus for French dialect identification comprising 413,522 French text samples.
The training, validation and test splits are collected from different news websites.
This leads to a French cross-domain (FreCDo) dialect identification task.
arXiv Detail & Related papers (2022-12-15T10:32:29Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - Longtonotes: OntoNotes with Longer Coreference Chains [111.73115731999793]
We build a corpus of coreference-annotated documents of significantly longer length than what is currently available.
The resulting corpus, which we call LongtoNotes, contains documents in multiple genres of the English language with varying lengths.
We evaluate state-of-the-art neural coreference systems on this new corpus.
arXiv Detail & Related papers (2022-10-07T15:58:41Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - RuCoCo: a new Russian corpus with coreference annotation [69.3939291118954]
We present a new corpus with coreference annotation, Russian Coreference Corpus (RuCoCo)
RuCoCo contains news texts in Russian, part of which were annotated from scratch, and for the rest the machine-generated annotations were refined by human annotators.
The size of our corpus is one million words and around 150,000 mentions.
arXiv Detail & Related papers (2022-06-10T07:50:09Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT [1.2891210250935146]
Wojood consists of 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types.
The data contains about 75K entities and 22.5% of which are nested.
Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.
arXiv Detail & Related papers (2022-05-19T16:06:49Z) - Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged
Amharic Corpus [0.04915744683251149]
Amharic corpus is partly a web corpus.
Texts are collected from 25,199 documents from different domains.
About 24 million orthographic words are tokenized.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.