FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark
- URL: http://arxiv.org/abs/2107.07498v1
- Date: Thu, 15 Jul 2021 17:51:25 GMT
- Title: FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark
- Authors: Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Hu Yuan, Huilin
Xu, Guoao Wei, Xiang Pan, Hai Hu
- Abstract summary: This work first introduces Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive small sample evaluation benchmark in Chinese.
An unlabeled training set with up to 20,000 additional samples per task is provided, allowing researchers to explore better ways of using unlabeled samples.
Next, we implement a set of state-of-the-art few-shot learning methods, and compare their performance with fine-tuning and zero-shot learning schemes on the newly constructed FewCLUE benchmark.
- Score: 8.158067688043554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretrained Language Models (PLMs) have achieved tremendous success in natural
language understanding tasks. While different learning schemes -- fine-tuning,
zero-shot and few-shot learning -- have been widely explored and compared for
languages such as English, there is comparatively little work in Chinese to
fairly and comprehensively evaluate and compare these methods. This work first
introduces Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first
comprehensive small sample evaluation benchmark in Chinese. It includes nine
tasks, ranging from single-sentence and sentence-pair classification tasks to
machine reading comprehension tasks. Given the high variance of the few-shot
learning performance, we provide multiple training/validation sets to
facilitate a more accurate and stable evaluation of few-shot modeling. An
unlabeled training set with up to 20,000 additional samples per task is
provided, allowing researchers to explore better ways of using unlabeled
samples. Next, we implement a set of state-of-the-art (SOTA) few-shot learning
methods (including PET, ADAPET, LM-BFF, P-tuning and EFL), and compare their
performance with fine-tuning and zero-shot learning schemes on the newly
constructed FewCLUE benchmark.Our results show that: 1) all five few-shot
learning methods exhibit better performance than fine-tuning or zero-shot
learning; 2) among the five methods, PET is the best performing few-shot
method; 3) few-shot learning performance is highly dependent on the specific
task. Our benchmark and code are available at
https://github.com/CLUEbenchmark/FewCLUE
Related papers
- Few-shot learning for sentence pair classification and its applications
in software engineering [0.36832029288386137]
This work is to investigate the performance of alternative few-shot learning approaches with BERT-based models.
vanilla fine-tuning, PET and SetFit are compared for numerous BERT-based checkpoints over an array of training set sizes.
Our results establish PET as a strong few-shot learning approach, and our analysis shows that with just a few hundred labeled examples it can achieve performance near that of fine-tuning on full-sized data sets.
arXiv Detail & Related papers (2023-06-13T18:23:52Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - Bag of Tricks for Effective Language Model Pretraining and Downstream
Adaptation: A Case Study on GLUE [93.98660272309974]
This report briefly describes our submission Vega v1 on the General Language Understanding Evaluation leaderboard.
GLUE is a collection of nine natural language understanding tasks, including question answering, linguistic acceptability, sentiment analysis, text similarity, paraphrase detection, and natural language inference.
With our optimized pretraining and fine-tuning strategies, our 1.3 billion model sets new state-of-the-art on 4/9 tasks, achieving the best average score of 91.3.
arXiv Detail & Related papers (2023-02-18T09:26:35Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - Multilingual Relation Classification via Efficient and Effective
Prompting [9.119073318043952]
We present the first work on prompt-based multilingual relation classification (RC)
We introduce an efficient and effective method that constructs prompts from relation triples and involves only minimal translation for the class labels.
We evaluate its performance in fully supervised, few-shot and zero-shot scenarios, and analyze its effectiveness across 14 languages.
arXiv Detail & Related papers (2022-10-25T08:40:23Z) - PERFECT: Prompt-free and Efficient Few-shot Learning with Language
Models [67.3725459417758]
PERFECT is a simple and efficient method for few-shot fine-tuning of PLMs without relying on any such handcrafting.
We show that manually engineered task prompts can be replaced with task-specific adapters that enable sample-efficient fine-tuning.
Experiments on a wide range of few-shot NLP tasks demonstrate that PERFECT, while being simple and efficient, also outperforms existing state-of-the-art few-shot learning methods.
arXiv Detail & Related papers (2022-04-03T22:31:25Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - LICHEE: Improving Language Model Pre-training with Multi-grained
Tokenization [19.89228774074371]
We propose a simple yet effective pre-training method named LICHEE to efficiently incorporate multi-grained information of input text.
Our method can be applied to various pre-trained language models and improve their representation capability.
arXiv Detail & Related papers (2021-08-02T12:08:19Z) - Making Pre-trained Language Models Better Few-shot Learners [11.90626040104822]
Recent GPT-3 model achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context.
Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient.
We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples.
arXiv Detail & Related papers (2020-12-31T17:21:26Z) - FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding [55.38905499274026]
Few-shot learning is one of the key future steps in machine learning.
FewJoint is a novel Few-Shot Learning benchmark for NLP.
arXiv Detail & Related papers (2020-09-17T08:17:12Z) - CLUE: A Chinese Language Understanding Evaluation Benchmark [41.86950255312653]
We introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark.
CLUE brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension.
We report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models.
arXiv Detail & Related papers (2020-04-13T15:02:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.