WEKA-Based: Key Features and Classifier for French of Five Countries
- URL: http://arxiv.org/abs/2212.08132v1
- Date: Thu, 10 Nov 2022 10:35:34 GMT
- Title: WEKA-Based: Key Features and Classifier for French of Five Countries
- Authors: Zeqian Li, Keyu Qiu, Chenxu Jiao, Wen Zhu, Haoran Tang
- Abstract summary: This paper describes a French dialect recognition system that will appropriately distinguish between different regional French dialects.
A corpus of five regions - Monaco, French-speaking, Belgium, French-speaking Switzerland, French-speaking Canada and France, which is targeted for construction by the Sketch Engine.
The content of the corpus is related to the four themes of eating, drinking, sleeping and living, which are closely linked to popular life.
- Score: 4.704992432252233
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This paper describes a French dialect recognition system that will
appropriately distinguish between different regional French dialects. A corpus
of five regions - Monaco, French-speaking, Belgium, French-speaking
Switzerland, French-speaking Canada and France, which is targeted
forconstruction by the Sketch Engine. The content of the corpus is related to
the four themes of eating, drinking, sleeping and living, which are closely
linked to popular life. The experimental results were obtained through the
processing of a python coded pre-processor and Waikato Environment for
Knowledge Analysis (WEKA) data analytic tool which contains many filters and
classifiers for machine learning.
Related papers
- ViSpeR: Multilingual Audio-Visual Speech Recognition [9.40993779729177]
This work presents an extensive and detailed study on Audio-Visual Speech Recognition for five widely spoken languages.
We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models.
Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language.
arXiv Detail & Related papers (2024-05-27T14:48:51Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - FreCDo: A Large Corpus for French Cross-Domain Dialect Identification [22.132457694021184]
We present a novel corpus for French dialect identification comprising 413,522 French text samples.
The training, validation and test splits are collected from different news websites.
This leads to a French cross-domain (FreCDo) dialect identification task.
arXiv Detail & Related papers (2022-12-15T10:32:29Z) - Benchmarking Transformers-based models on French Spoken Language
Understanding tasks [4.923118300276026]
We benchmark 13 Transformer-based models on two spoken language understanding tasks for French: MEDIA and ATIS-FR.
We show that compact models can reach comparable results to bigger ones while their ecological impact is considerably lower.
arXiv Detail & Related papers (2022-07-19T09:47:08Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - A Monolingual Approach to Contextualized Word Embeddings for
Mid-Resource Languages [0.0]
We train monolingual contextualized word embeddings (ELMo) for five mid-resource languages.
We compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks.
We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia.
arXiv Detail & Related papers (2020-06-11T05:25:18Z) - Automatic Discourse Segmentation: an evaluation in French [65.00134288222509]
We describe some discursive segmentation methods as well as a preliminary evaluation of the segmentation quality.
We have developed three models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling.
arXiv Detail & Related papers (2020-02-10T21:35:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.