Establishing a New State-of-the-Art for French Named Entity Recognition
- URL: http://arxiv.org/abs/2005.13236v1
- Date: Wed, 27 May 2020 08:44:09 GMT
- Title: Establishing a New State-of-the-Art for French Named Entity Recognition
- Authors: Pedro Javier Ortiz Su\'arez (ALMAnaCH, SU), Yoann Dupont (ALMAnaCH,
SU), Benjamin Muller (ALMAnaCH, SU), Laurent Romary (ALMAnaCH), Beno\^it
Sagot (ALMAnaCH)
- Abstract summary: The French TreeBank is the main source of morphosyntactic and syntactic annotations for French.
It does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications.
We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The French TreeBank developed at the University Paris 7 is the main source of
morphosyntactic and syntactic annotations for French. However, it does not
include explicit information related to named entities, which are among the
most useful information for several natural language processing tasks and
applications. Moreover, no large-scale French corpus with named entity
annotations contain referential information, which complement the type and the
span of each mention with an indication of the entity it refers to. We have
manually annotated the French TreeBank with such information, after an
automatic pre-annotation step. We sketch the underlying annotation guidelines
and we provide a few figures about the resulting annotations.
Related papers
- FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis [0.0]
The Algerian dialect (AD) faces challenges due to the absence of annotated corpora.
This study outlines the development process of a specialized corpus for Fake News (FN) detection and sentiment analysis (SA) in AD called FASSILA.
arXiv Detail & Related papers (2024-11-07T10:39:10Z) - Different Tastes of Entities: Investigating Human Label Variation in
Named Entity Annotations [23.059491714512077]
This paper studies disagreements in expert-annotated named entity datasets for three languages: English, Danish, and Bavarian.
We show that text ambiguity and artificial guideline changes are dominant factors for diverse annotations among high-quality revisions.
arXiv Detail & Related papers (2024-02-02T14:08:34Z) - FRACAS: A FRench Annotated Corpus of Attribution relations in newS [0.0]
We present a manually annotated corpus of 1676 newswire texts in French for quotation extraction and source attribution.
We first describe the composition of our corpus and the choices that were made in selecting the data.
We then detail our inter-annotator agreement between the 8 annotators who worked on manual labelling.
arXiv Detail & Related papers (2023-09-19T13:19:54Z) - Building and Evaluating Universal Named-Entity Recognition English
corpus [0.0]
This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora.
By using a workflow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated.
arXiv Detail & Related papers (2022-12-14T11:32:24Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Annotation Curricula to Implicitly Train Non-Expert Annotators [56.67768938052715]
voluntary studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain.
This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations.
We propose annotation curricula, a novel approach to implicitly train annotators.
arXiv Detail & Related papers (2021-06-04T09:48:28Z) - AMALGUM -- A Free, Balanced, Multilayer English Web Corpus [14.073494095236027]
We present a genre-balanced English web corpus totaling 4M tokens.
By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets.
arXiv Detail & Related papers (2020-06-18T17:05:45Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z) - A Corpus Study and Annotation Schema for Named Entity Recognition and
Relation Extraction of Business Products [68.26059718611914]
We present a corpus study, an annotation schema and associated guidelines, for the annotation of product entity and company-product relation mentions.
We find that although product mentions are often realized as noun phrases, defining their exact extent is difficult due to high boundary ambiguity.
We present a preliminary corpus of English web and social media documents annotated according to the proposed guidelines.
arXiv Detail & Related papers (2020-04-07T11:45:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.