Related papers: COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

URL: http://arxiv.org/abs/2403.18058v1
Date: Tue, 26 Mar 2024 19:24:18 GMT
Title: COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Authors: Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, Wenhu Chen, Chenghua Lin, Jie Fu, Min Yang, Shiwen Ni, Ge Zhang,
Abstract summary: We introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset. Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions. We train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses.
Score: 57.600941792026006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, there have been significant advancements in large language models (LLMs), particularly focused on the English language. These advancements have enabled these LLMs to understand and execute complex instructions with unprecedented accuracy and fluency. However, despite these advancements, there remains a noticeable gap in the development of Chinese instruction tuning. The unique linguistic features and cultural depth of the Chinese language pose challenges for instruction tuning tasks. Existing datasets are either derived from English-centric LLMs or are ill-suited for aligning with the interaction patterns of real-world Chinese users. To bridge this gap, we introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset. Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions. To this end, we collect a high-quality human-written corpus from various sources on the Chinese Internet, including Q&A communities, Wikis, examinations, and existing NLP datasets. This corpus was rigorously filtered and carefully processed to form the COIG-CQIA dataset. Furthermore, we train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses. The findings from our experiments offer valuable insights for selecting and developing Chinese instruction-tuning datasets. We also find that models trained on CQIA-Subset achieve competitive results in human assessment as well as knowledge and security benchmarks. Data are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA

Related papers

Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging [4.473623071673054]
We investigate the impact of training data composition on Cross-Lingual Information Retrieval ( CLIR) and Mono-Lingual Information Retrieval (IR) performance.<n>Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations.<n>Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities.
arXiv Detail & Related papers (2025-07-11T10:44:09Z)
Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance [38.362162910767466]
We conduct the first comprehensive analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk.<n>We derive statistics that reveal structural and qualitative similarities and differences between the two datasets.<n>Our findings offer actionable insights for constructing more effective post-training datasets.
arXiv Detail & Related papers (2025-06-06T20:34:06Z)
CLIMB: Class-imbalanced Learning Benchmark on Tabular Data [68.07599497425267]
Class-imbalanced learning (CIL) is important in many real-world applications where the minority class holds the critical but rare outcomes.<n>In this paper, we present CLIMB, a comprehensive benchmark for class-imbalanced learning.<n> CLIMB includes 73 real-world datasets across diverse domains and imbalance levels, along with unified implementations of 29 representative CIL algorithms.
arXiv Detail & Related papers (2025-05-23T04:21:03Z)
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values [43.09443095372083]
We introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset. It comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model.
arXiv Detail & Related papers (2025-04-07T22:15:51Z)
CNsum:Automatic Summarization for Chinese News Text [7.181538768266782]
This paper proposes a Chinese news text summarization model (CNsum) based on Transformer structure. The results of the conducted experiments show that CNsum achieves better ROUGE score than the baseline models.
arXiv Detail & Related papers (2025-02-27T03:25:34Z)
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training [5.372706159579268]
The OpenCSG Chinese Corpus is a series of high-quality datasets specifically designed for Chinese language training. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes.
arXiv Detail & Related papers (2025-01-14T15:22:47Z)
A Chinese Continuous Sign Language Dataset Based on Complex Environments [17.195286118443256]
We have constructed a large-scale dataset for Chinese continuous sign language (CSL) based on complex environments. This dataset encompasses 5,988 continuous CSL video clips collected from daily life scenes. We propose a time-frequency network (TFNet) model for continuous sign language recognition.
arXiv Detail & Related papers (2024-09-18T13:11:15Z)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning [1.6570772838074355]
multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA) Recent efforts primarily focus on scaling up training datasets through data collection and synthesis. We propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development.
arXiv Detail & Related papers (2024-07-29T17:04:34Z)
CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare [12.218718086529462]
This study focuses on the Comprehensive Medical Benchmark in Chinese (CMB) We successfully trained a smaller base model to achieve scores comparable to larger models. By integrating a wide range of instructional content, our approach addresses potential issues such as data quality inconsistencies.
arXiv Detail & Related papers (2024-07-29T05:00:48Z)
Research on Information Extraction of LCSTS Dataset Based on an Improved BERTSum-LSTM Model [3.942479021508835]
This paper studies the information extraction method of the LCSTS dataset based on an improved BERTSum-LSTM model. We improve the BERTSum-LSTM model to make it perform better in generating Chinese news summaries.
arXiv Detail & Related papers (2024-06-26T14:04:15Z)
Contextualization Distillation from Large Language Model for Knowledge Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z)
Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation [30.053409671898933]
Kun is a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations. We leverage unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points.
arXiv Detail & Related papers (2024-01-12T09:56:57Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
CLEVA: Chinese Language Models EVAluation Platform [92.42981537317817]
We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding.
arXiv Detail & Related papers (2023-08-09T09:11:31Z)
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models. We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.