CCAE: A Corpus of Chinese-based Asian Englishes
- URL: http://arxiv.org/abs/2310.05381v1
- Date: Mon, 9 Oct 2023 03:34:15 GMT
- Title: CCAE: A Corpus of Chinese-based Asian Englishes
- Authors: Yang Liu, Melissa Xiaohui Qin, Long Wang, and Chao Huang
- Abstract summary: This paper represents one of the few initial efforts to utilize the NLP technology in the paradigm of World Englishes.
We present an overview of the CCAE -- Corpus of Chinese-based Asian English, a suite of corpora comprising six Chinese-based Asian English varieties.
- Score: 8.563253881619124
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Language models have been foundations in various scenarios of NLP
applications, but it has not been well applied in language variety studies,
even for the most popular language like English. This paper represents one of
the few initial efforts to utilize the NLP technology in the paradigm of World
Englishes, specifically in creating a multi-variety corpus for studying Asian
Englishes. We present an overview of the CCAE -- Corpus of Chinese-based Asian
English, a suite of corpora comprising six Chinese-based Asian English
varieties. It is based on 340 million tokens in 448 thousand web documents from
six regions. The ontology of data would make the corpus a helpful resource with
enormous research potential for Asian Englishes (especially for Chinese
Englishes for which there has not been a publicly accessible corpus yet so far)
and an ideal source for variety-specific language modeling and downstream
tasks, thus setting the stage for NLP-based World Englishes studies. And
preliminary experiments on this corpus reveal the practical value of CCAE.
Finally, we make CCAE available at
\href{https://huggingface.co/datasets/CCAE/CCAE-Corpus}{this https URL}.
Related papers
- Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP)
When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs.
It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z) - Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance [6.907734681124986]
This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts.
We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada.
arXiv Detail & Related papers (2024-06-17T01:54:27Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - YACLC: A Chinese Learner Corpus with Multidimensional Annotation [45.304130762057945]
We construct a large-scale, multidimensional annotated Chinese learner corpus.
By analyzing the original sentences and annotations in the corpus, we found that YACLC has a considerable size and very high annotation quality.
arXiv Detail & Related papers (2021-12-30T13:07:08Z) - Cross-Lingual Training with Dense Retrieval for Document Retrieval [56.319511218754414]
We explore different transfer techniques for document ranking from English annotations to multiple non-English languages.
Experiments on the test collections in six languages (Chinese, Arabic, French, Hindi, Bengali, Spanish) from diverse language families.
We find that weakly-supervised target language transfer yields competitive performances against the generation-based target language transfer.
arXiv Detail & Related papers (2021-09-03T17:15:38Z) - Igbo-English Machine Translation: An Evaluation Benchmark [3.0151383439513753]
We discuss our effort toward building a standard machine translation benchmark dataset for Igbo.
Igbo is spoken by more than 50 million people globally with over 50% of the speakers are in southeastern Nigeria.
arXiv Detail & Related papers (2020-04-01T18:06:21Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.