Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish
Biomedical Language Models
- URL: http://arxiv.org/abs/2109.07765v1
- Date: Thu, 16 Sep 2021 07:22:28 GMT
- Title: Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish
Biomedical Language Models
- Authors: Casimiro Pio Carrino, Jordi Armengol-Estap\'e, Ona de Gibert Bonet,
Asier Guti\'errez-Fandi\~no, Aitor Gonzalez-Agirre, Martin Krallinger, Marta
Villegas
- Abstract summary: CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020.
The corpus is openly available and already preprocessed.
CoWeSe is an important resource for biomedical and health NLP in Spanish.
- Score: 0.05277024349608833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish
biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean
plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains
executed in 2020. The corpus is openly available and already preprocessed.
CoWeSe is an important resource for biomedical and health NLP in Spanish and
has already been employed to train domain-specific language models and to
produce word embbedings. We released the CoWeSe corpus under a Creative Commons
Attribution 4.0 International license, both in Zenodo
(\url{https://zenodo.org/record/4561971\#.YTI5SnVKiEA}).
Related papers
- Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data.
We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z) - Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain [19.58987478434808]
We present Medical mT5, the first open-source text-to-text multilingual model for the medical domain.
A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks.
arXiv Detail & Related papers (2024-04-11T10:01:32Z) - MegaWika: Millions of reports and their sources across 50 diverse
languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z) - esCorpius: A Massive Spanish Crawling Corpus [2.262838186547612]
esCorpius is a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data.
It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content.
arXiv Detail & Related papers (2022-06-30T09:29:18Z) - Multilingual Open Text 1.0: Public Domain News in 44 Languages [2.642698101441705]
The first release of the corpus contains over 2.7 million news articles and 1 million shorter passages published between 2001--2021.
The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0) and all software used to create the corpus is released under the MIT License.
arXiv Detail & Related papers (2022-01-14T18:58:17Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - A Multilingual Neural Machine Translation Model for Biomedical Data [84.17747489525794]
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain.
The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English.
It is trained with large amounts of generic and biomedical data, using domain tags.
arXiv Detail & Related papers (2020-08-06T21:26:43Z) - GGPONC: A Corpus of German Medical Text with Rich Metadata Based on
Clinical Practice Guidelines [4.370297546680015]
GGPONC is a freely distributable German language corpus based on clinical practice guidelines for oncology.
GGPONC is the first corpus for the German language covering diverse conditions in a large medical subfield.
By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language.
arXiv Detail & Related papers (2020-07-13T14:25:49Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.