READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises
- URL: http://arxiv.org/abs/2302.07324v2
- Date: Thu, 25 May 2023 01:04:08 GMT
- Title: READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises
- Authors: Chenglei Si, Zhengyan Zhang, Yingfa Chen, Xiaozhi Wang, Zhiyuan Liu,
Maosong Sun
- Abstract summary: We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
- Score: 87.70001456418504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For many real-world applications, the user-generated inputs usually contain
various noises due to speech recognition errors caused by linguistic
variations1 or typographical errors (typos). Thus, it is crucial to test model
performance on data with realistic input noises to ensure robustness and
fairness. However, little study has been done to construct such benchmarks for
Chinese, where various language-specific input noises happen in the real world.
In order to fill this important gap, we construct READIN: a Chinese multi-task
benchmark with REalistic And Diverse Input Noises. READIN contains four diverse
tasks and requests annotators to re-enter the original test data with two
commonly used Chinese input methods: Pinyin input and speech input. We designed
our annotation pipeline to maximize diversity, for example by instructing the
annotators to use diverse input method editors (IMEs) for keyboard noises and
recruiting speakers from diverse dialectical groups for speech noises. We
experiment with a series of strong pretrained language models as well as robust
training methods, we find that these models often suffer significant
performance drops on READIN even with robustness methods like data
augmentation. As the first large-scale attempt in creating a benchmark with
noises geared towards user-generated inputs, we believe that READIN serves as
an important complement to existing Chinese NLP benchmarks. The source code and
dataset can be obtained from https://github.com/thunlp/READIN.
Related papers
- Take the Hint: Improving Arabic Diacritization with
Partially-Diacritized Text [4.863310073296471]
We propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions.
We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking.
arXiv Detail & Related papers (2023-06-06T10:18:17Z) - Robustification of Multilingual Language Models to Real-world Noise with
Robust Contrastive Pretraining [14.087882550564169]
We assess the robustness of neural models on noisy data and suggest improvements are limited to the English language.
To benchmark the performance of pretrained multilingual models, we construct noisy datasets covering five languages and four NLP tasks.
We propose Robust Contrastive Pretraining (RCP) to boost the zero-shot cross-lingual robustness of multilingual pretrained models.
arXiv Detail & Related papers (2022-10-10T15:40:43Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - Understanding Model Robustness to User-generated Noisy Texts [2.958690090551675]
In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors.
We propose to model the errors statistically from grammatical-error-correction corpora.
arXiv Detail & Related papers (2021-10-14T14:54:52Z) - Learning from Multiple Noisy Augmented Data Sets for Better
Cross-Lingual Spoken Language Understanding [69.40915115518523]
Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages.
Various data augmentation approaches have been proposed to synthesize training data in low-resource target languages.
In this paper we focus on mitigating noise in augmented data.
arXiv Detail & Related papers (2021-09-03T15:44:15Z) - FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark [8.158067688043554]
This work first introduces Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive small sample evaluation benchmark in Chinese.
An unlabeled training set with up to 20,000 additional samples per task is provided, allowing researchers to explore better ways of using unlabeled samples.
Next, we implement a set of state-of-the-art few-shot learning methods, and compare their performance with fine-tuning and zero-shot learning schemes on the newly constructed FewCLUE benchmark.
arXiv Detail & Related papers (2021-07-15T17:51:25Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z) - One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech [3.42658286826597]
We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation.
Our model is shown to effectively share information across languages and according to a subjective evaluation test, it produces more natural and accurate code-switching speech than the baselines.
arXiv Detail & Related papers (2020-08-03T10:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.