FRASIMED: a Clinical French Annotated Resource Produced through
Crosslingual BERT-Based Annotation Projection
- URL: http://arxiv.org/abs/2309.10770v1
- Date: Tue, 19 Sep 2023 17:17:28 GMT
- Title: FRASIMED: a Clinical French Annotated Resource Produced through
Crosslingual BERT-Based Annotation Projection
- Authors: Jamil Zaghir, Mina Bjelogrlic, Jean-Philippe Goldman, Souka\"ina
Aananou, Christophe Gaudet-Blavignac and Christian Lovis
- Abstract summary: This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection.
We present the creation of French Annotated Resource with Semantic Information for Medical Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French.
- Score: 0.6116681488656472
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Natural language processing (NLP) applications such as named entity
recognition (NER) for low-resource corpora do not benefit from recent advances
in the development of large language models (LLMs) where there is still a need
for larger annotated datasets. This research article introduces a methodology
for generating translated versions of annotated datasets through crosslingual
annotation projection. Leveraging a language agnostic BERT-based approach, it
is an efficient solution to increase low-resource corpora with few human
efforts and by only using already available open data resources. Quantitative
and qualitative evaluations are often lacking when it comes to evaluating the
quality and effectiveness of semi-automatic data generation strategies. The
evaluation of our crosslingual annotation projection approach showed both
effectiveness and high accuracy in the resulting dataset. As a practical
application of this methodology, we present the creation of French Annotated
Resource with Semantic Information for Medical Entities Detection (FRASIMED),
an annotated corpus comprising 2'051 synthetic clinical cases in French. The
corpus is now available for researchers and practitioners to develop and refine
French natural language processing (NLP) applications in the clinical field
(https://zenodo.org/record/8355629), making it the largest open annotated
corpus with linked medical concepts in French.
Related papers
- GPTs Are Multilingual Annotators for Sequence Generation Tasks [11.59128394819439]
This study proposes an autonomous annotation method by utilizing large language models.
We demonstrate that the proposed method is not just cost-efficient but also applicable for low-resource language annotation.
arXiv Detail & Related papers (2024-02-08T09:44:02Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Advancing Italian Biomedical Information Extraction with
Transformers-based Models: Methodological Insights and Multicenter Practical
Application [0.27027468002793437]
Information Extraction can help clinical practitioners overcome the limitation by using automated text-mining pipelines.
We created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Transformers-based model.
The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "low-resource" approach.
arXiv Detail & Related papers (2023-06-08T16:15:46Z) - Cross-lingual Argument Mining in the Medical Domain [6.0158981171030685]
We show how to perform Argument Mining (AM) in medical texts for which no annotated data is available.
Our work shows that automatically translating and projecting annotations (data-transfer) from English to a given target language is an effective way to generate annotated data.
We also show how the automatically generated data in Spanish can also be used to improve results in the original English monolingual setting.
arXiv Detail & Related papers (2023-01-25T11:21:12Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Efficient Entity Candidate Generation for Low-Resource Languages [13.789451365205665]
Candidate generation is a crucial module in entity linking.
It plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases.
This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking.
arXiv Detail & Related papers (2022-06-30T09:49:53Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.