CSL: A Large-scale Chinese Scientific Literature Dataset
- URL: http://arxiv.org/abs/2209.05034v1
- Date: Mon, 12 Sep 2022 06:10:47 GMT
- Title: CSL: A Large-scale Chinese Scientific Literature Dataset
- Authors: Yudong Li, Yuqing Zhang, Zhe Zhao, Linlin Shen, Weijie Liu, Weiquan
Mao, and Hui Zhang
- Abstract summary: We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers.
To our knowledge, CSL is the first scientific document dataset in Chinese. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks.
We present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification.
- Score: 30.606855209042603
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Scientific literature serves as a high-quality corpus, supporting a lot of
Natural Language Processing (NLP) research. However, existing datasets are
centered around the English language, which restricts the development of
Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese
Scientific Literature dataset, which contains the titles, abstracts, keywords
and academic fields of 396k papers. To our knowledge, CSL is the first
scientific document dataset in Chinese. The CSL can serve as a Chinese corpus.
Also, this semi-structured data is a natural annotation that can constitute
many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the
performance of models across scientific domain tasks, i.e., summarization,
keyword generation and text classification. We analyze the behavior of existing
text-to-text models on the evaluation tasks and reveal the challenges for
Chinese scientific NLP tasks, which provides a valuable reference for future
research. Data and code are available at https://github.com/ydli-ai/CSL
Related papers
- MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference [65.37685198688538]
This paper presents MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains.
We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs)
We show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset.
arXiv Detail & Related papers (2024-04-11T18:12:12Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - SciNLI: A Corpus for Natural Language Inference on Scientific Text [47.293189105900524]
We introduce SciNLI, a large dataset for NLI that captures the formality in scientific text.
Our best performing model with XLNet achieves a Macro F1 score of only 78.18% and an accuracy of 78.23%.
arXiv Detail & Related papers (2022-03-13T18:23:37Z) - Improving Sign Language Translation with Monolingual Data by Sign
Back-Translation [105.83166521438463]
We propose a sign back-translation (SignBT) approach, which incorporates massive spoken language texts into sign training.
With a text-to-gloss translation model, we first back-translate the monolingual text to its gloss sequence.
Then, the paired sign sequence is generated by splicing pieces from an estimated gloss-to-sign bank at the feature level.
arXiv Detail & Related papers (2021-05-26T08:49:30Z) - Automatic coding of students' writing via Contrastive Representation
Learning in the Wasserstein space [6.884245063902909]
This work is a step towards building a statistical machine learning (ML) method for supporting qualitative analyses of students' writing.
We show that the ML algorithm approached the inter-rater reliability of human analysis.
arXiv Detail & Related papers (2020-11-26T16:52:48Z) - OCNLI: Original Chinese Natural Language Inference [21.540733910984006]
We present the first large-scale NLI dataset (consisting of 56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI)
Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation.
We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance.
arXiv Detail & Related papers (2020-10-12T04:25:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.