Low-resource Information Extraction with the European Clinical Case Corpus
- URL: http://arxiv.org/abs/2503.20568v1
- Date: Wed, 26 Mar 2025 14:07:40 GMT
- Title: Low-resource Information Extraction with the European Clinical Case Corpus
- Authors: Soumitra Ghosh, Begona Altuna, Saeed Farzi, Pietro Ferrazzi, Alberto Lavelli, Giulia Mezzanotte, Manuela Speranza, Bernardo Magnini,
- Abstract summary: We present E3C-3.0, a multilingual dataset in the medical domain.<n>The dataset includes both native texts in five languages and texts translated and projected from the English source into five target languages.<n>A semi-automatic approach has been implemented, including automatic annotation projection.
- Score: 4.747950273856823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89 .
Related papers
- Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language [34.54405113575568]
Machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual models.
We show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data.
We release our corpus, models, and training pipeline under open licenses at hf.co/britllm/CuatroLLM.
arXiv Detail & Related papers (2024-10-31T14:09:50Z) - Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? [112.0422370149713]
We tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data.<n>We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers.<n>We show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources.
arXiv Detail & Related papers (2024-07-23T16:13:22Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, i.e., be crosslingual?
This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - LLM-powered Data Augmentation for Enhanced Cross-lingual Performance [24.20730298894794]
This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in commonsense reasoning datasets.
To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze.
We evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data.
arXiv Detail & Related papers (2023-05-23T17:33:27Z) - How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning [14.02101305717738]
Multilingual large language models (MLLMs) are jointly trained on data from many different languages.
It remains unclear to what extent, and under which conditions, languages rely on each other's data.
We find that MLLMs rely on data from multiple languages from the early stages of fine-tuning and that this reliance gradually increases as fine-tuning progresses.
arXiv Detail & Related papers (2023-05-22T17:47:41Z) - Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Cross-lingual Emotion Detection [6.767035411834297]
We consider English as the source language with Arabic and Spanish as target languages.
Our BERT-based monolingual models that are trained on target language data surpass state-of-the-art (SOTA) by 4% and 5% absolute Jaccard score for Arabic and Spanish respectively.
Next, we show that using cross-lingual approaches with English data alone, we can achieve more than 90% and 80% relative effectiveness of the Arabic and Spanish BERT models respectively.
arXiv Detail & Related papers (2021-06-10T19:52:06Z) - Leveraging Monolingual Data with Self-Supervision for Multilingual
Neural Machine Translation [54.52971020087777]
Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models.
Self-supervision improves zero-shot translation quality in multilingual models.
We get up to 33 BLEU on ro-en translation without any parallel data or back-translation.
arXiv Detail & Related papers (2020-05-11T00:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.