FreCDo: A Large Corpus for French Cross-Domain Dialect Identification
- URL: http://arxiv.org/abs/2212.07707v1
- Date: Thu, 15 Dec 2022 10:32:29 GMT
- Title: FreCDo: A Large Corpus for French Cross-Domain Dialect Identification
- Authors: Mihaela Gaman, Adrian-Gabriel Chifu, William Domingues, Radu Tudor
Ionescu
- Abstract summary: We present a novel corpus for French dialect identification comprising 413,522 French text samples.
The training, validation and test splits are collected from different news websites.
This leads to a French cross-domain (FreCDo) dialect identification task.
- Score: 22.132457694021184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel corpus for French dialect identification comprising
413,522 French text samples collected from public news websites in Belgium,
Canada, France and Switzerland. To ensure an accurate estimation of the dialect
identification performance of models, we designed the corpus to eliminate
potential biases related to topic, writing style, and publication source. More
precisely, the training, validation and test splits are collected from
different news websites, while searching for different keywords (topics). This
leads to a French cross-domain (FreCDo) dialect identification task. We conduct
experiments with four competitive baselines, a fine-tuned CamemBERT model, an
XGBoost based on fine-tuned CamemBERT features, a Support Vector Machines (SVM)
classifier based on fine-tuned CamemBERT features, and an SVM based on word
n-grams. Aside from presenting quantitative results, we also make an analysis
of the most discriminative features learned by CamemBERT. Our corpus is
available at https://github.com/MihaelaGaman/FreCDo.
Related papers
- A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Entity-Assisted Language Models for Identifying Check-worthy Sentences [23.792877053142636]
We propose a new uniform framework for text classification and ranking.
Our framework combines the semantic analysis of the sentences, with additional entity embeddings obtained through the identified entities within the sentences.
We extensively evaluate the effectiveness of our framework using two publicly available datasets from the CLEF's 2019 & 2020 CheckThat! Labs.
arXiv Detail & Related papers (2022-11-19T12:03:30Z) - WEKA-Based: Key Features and Classifier for French of Five Countries [4.704992432252233]
This paper describes a French dialect recognition system that will appropriately distinguish between different regional French dialects.
A corpus of five regions - Monaco, French-speaking, Belgium, French-speaking Switzerland, French-speaking Canada and France, which is targeted for construction by the Sketch Engine.
The content of the corpus is related to the four themes of eating, drinking, sleeping and living, which are closely linked to popular life.
arXiv Detail & Related papers (2022-11-10T10:35:34Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language
Models [0.0]
We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks.
We introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain (TLD)
We show that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages.
arXiv Detail & Related papers (2022-01-14T18:45:31Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.