HmBlogs: A big general Persian corpus
- URL: http://arxiv.org/abs/2111.02362v1
- Date: Wed, 3 Nov 2021 17:26:52 GMT
- Title: HmBlogs: A big general Persian corpus
- Authors: Hamzeh Motahari Khansari, Mehrnoush Shamsfard
- Abstract summary: This paper introduces the hmBlogs corpus for Persian, as a low resource language.
This corpus has been prepared based on a collection of nearly 20 million blog posts over a period of about 15 years from a space of Persian blogs.
It can be claimed that this corpus is currently the largest Persian corpus that has been prepared independently for the Persian language.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces the hmBlogs corpus for Persian, as a low resource
language. This corpus has been prepared based on a collection of nearly 20
million blog posts over a period of about 15 years from a space of Persian
blogs and includes more than 6.8 billion tokens. It can be claimed that this
corpus is currently the largest Persian corpus that has been prepared
independently for the Persian language. This corpus is presented in both raw
and preprocessed forms, and based on the preprocessed corpus some word
embedding models are produced. By the provided models, the hmBlogs is compared
with some of the most important corpora available in Persian, and the results
show the superiority of the hmBlogs corpus over the others. These evaluations
also present the importance and effects of corpora, evaluation datasets, model
production methods, different hyperparameters and even the evaluation methods.
In addition to evaluating the corpus and its produced language models, this
research also presents a semantic analogy dataset.
Related papers
- DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models [78.51470038301436]
We introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing.
We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM.
Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.
arXiv Detail & Related papers (2024-10-08T02:42:56Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some
benchmarks [9.160401226886947]
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech.
The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services.
We present the collection process and the collected corpus, and showcase its versatility through multiple use cases.
arXiv Detail & Related papers (2022-03-24T07:50:25Z) - What's in the Box? An Analysis of Undesirable Content in the Common
Crawl Corpus [77.34726150561087]
We analyze the Common Crawl, a colossal web corpus extensively used for training language models.
We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures.
arXiv Detail & Related papers (2021-05-06T14:49:43Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - The birth of Romanian BERT [1.377045689881944]
This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus.
We discuss corpus composition and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets.
arXiv Detail & Related papers (2020-09-18T09:30:48Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z) - CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language
Model [15.469228003507919]
We introduce the Chinese corpus from CLUE organization, CLUECorpus 2020.
It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.
We release a new Chinese vocabulary with a size of 8K, which is only one-third of the vocabulary size used in Chinese Bert released by Google.
arXiv Detail & Related papers (2020-03-03T06:39:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.