The Claire French Dialogue Dataset
- URL: http://arxiv.org/abs/2311.16840v1
- Date: Tue, 28 Nov 2023 14:55:22 GMT
- Title: The Claire French Dialogue Dataset
- Authors: Julie Hunter, J\'er\^ome Louradour, Virgile Rennard, Isma\"il
Harrando, Guokan Shang, Jean-Pierre Lorr\'e
- Abstract summary: This paper describes the 24 individual corpora of which CFDD is composed and provides links and citations to their original sources.
It also provides our proposed breakdown of the full CFDD dataset into eight categories of subcorpora and describes the process we followed to standardize the format of the final dataset.
- Score: 9.45456707528025
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present the Claire French Dialogue Dataset (CFDD), a resource created by
members of LINAGORA Labs in the context of the OpenLLM France initiative. CFDD
is a corpus containing roughly 160 million words from transcripts and stage
plays in French that we have assembled and publicly released in an effort to
further the development of multilingual, open source language models. This
paper describes the 24 individual corpora of which CFDD is composed and
provides links and citations to their original sources. It also provides our
proposed breakdown of the full CFDD dataset into eight categories of subcorpora
and describes the process we followed to standardize the format of the final
dataset. We conclude with a discussion of similar work and future directions.
Related papers
- A French Version of the OLDI Seed Corpus [20.630120942837564]
We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task.<n>We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers.<n>This French corpus is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.
arXiv Detail & Related papers (2025-08-04T10:57:54Z) - Building a Functional Machine Translation Corpus for Kpelle [0.0]
This paper introduces the first publicly available English-Kpelle dataset for machine translation.<n>By fine-tuning Meta's No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction.
arXiv Detail & Related papers (2025-05-24T23:39:34Z) - FFSTC: Fongbe to French Speech Translation Corpus [0.0]
We introduce the Fongbe to French Speech Translation Corpus (FFSTC) for the first time.
This corpus encompasses approximately 31 hours of collected Fongbe language content, featuring both French transcriptions and corresponding Fongbe voice recordings.
arXiv Detail & Related papers (2024-03-08T17:53:58Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - CDA: a Cost Efficient Content-based Multilingual Web Document Aligner [97.98885151955467]
We introduce a Content-based Document Alignment approach to align multilingual web documents based on content.
We leverage lexical translation models to build vector representations using TF-IDF.
Experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.
arXiv Detail & Related papers (2021-02-20T03:37:23Z) - FFR v1.1: Fon-French Neural Machine Translation [0.012691047660244334]
FFR project is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French.
In this paper, we introduce FFR dataset, a corpus of Fon-to-French translations, describe the diacritical encoding process, and introduce our FFR v1.1 model.
arXiv Detail & Related papers (2020-06-14T04:27:12Z) - FQuAD: French Question Answering Dataset [0.4759823735082845]
We introduce the French Question Answering dataset (FQuAD)
FQuAD is a French Native Reading dataset of questions and answers on a set of Wikipedia articles.
We train a baseline model which achieves an F1 score of 92.2 and an exact match ratio of 82.1 on the test set.
arXiv Detail & Related papers (2020-02-14T15:23:38Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.