Novi jezi\v{c}ki modeli za srpski jezik
- URL: http://arxiv.org/abs/2402.14379v2
- Date: Fri, 23 Feb 2024 09:00:49 GMT
- Title: Novi jezi\v{c}ki modeli za srpski jezik
- Authors: Mihailo \v{S}kori\'c
- Abstract summary: The paper will briefly present the development history of transformer-based language models for the Serbian language.
Ten selected vectorization models for Serbian, including two new ones, will be compared on four natural language processing tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The paper will briefly present the development history of transformer-based
language models for the Serbian language. Several new models for text
generation and vectorization, trained on the resources of the Society for
Language Resources and Technologies, will also be presented. Ten selected
vectorization models for Serbian, including two new ones, will be compared on
four natural language processing tasks. Paper will analyze which models are the
best for each selected task, how does their size and the size of their training
sets affect the performance on those tasks, and what is the optimal setting to
train the best language models for the Serbian language.
Related papers
- A Dataset and Strong Baselines for Classification of Czech News Texts [0.0]
We present CZEchNEwsClassificationdataset (CZE-NEC), one of the largest Czech classification datasets.
We define four classification tasks: news source, news category, inferred author's gender, and day of the week.
We show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
arXiv Detail & Related papers (2023-07-20T07:47:08Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - TunBERT: Pretrained Contextualized Text Representation for Tunisian
Dialect [0.0]
We investigate the feasibility of training monolingual Transformer-based language models for under represented languages.
We show that the use of noisy web crawled data instead of structured data is more convenient for such non-standardized language.
Our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks.
arXiv Detail & Related papers (2021-11-25T15:49:50Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - Scribosermo: Fast Speech-to-Text models for German and other Languages [69.7571480246023]
This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features.
They are small and run in real-time on microcontrollers like a RaspberryPi.
Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset.
arXiv Detail & Related papers (2021-10-15T10:10:34Z) - Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Testing pre-trained Transformer models for Lithuanian news clustering [0.0]
Non-English languages could not leverage such new opportunities with the English text pre-trained models.
We compare pre-trained multilingual BERT, XLM-R, and older learned text representation methods as encodings for the task of Lithuanian news clustering.
Our results indicate that publicly available pre-trained multilingual Transformer models can be fine-tuned to surpass word vectors but still score much lower than specially trained doc2vec embeddings.
arXiv Detail & Related papers (2020-04-03T14:41:54Z) - Give your Text Representation Models some Love: the Case for Basque [24.76979832867631]
Word embeddings and pre-trained language models allow to build rich representations of text.
Many small companies and research groups tend to use models that have been pre-trained and made available by third parties.
This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora.
We show that a number of monolingual models trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks.
arXiv Detail & Related papers (2020-03-31T18:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.