COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
- URL: http://arxiv.org/abs/2403.18058v2
- Date: Sat, 02 Nov 2024 11:08:49 GMT
- Title: COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
- Authors: Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Junting Zhou, Ziqiang Liu, Feiteng Fang, Mingshan Chang, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, Chenghua Lin, Jie Fu, Min Yang, Shiwen Ni, Ge Zhang,
- Abstract summary: We introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification.
We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets.
The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks.
- Score: 37.843051974342124
- License:
- Abstract: Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of large language models (LLMs). However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users' interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data-mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
Related papers
- OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training [5.372706159579268]
The OpenCSG Chinese Corpus is a series of high-quality datasets specifically designed for Chinese language training.
This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese.
The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes.
arXiv Detail & Related papers (2025-01-14T15:22:47Z) - A Chinese Continuous Sign Language Dataset Based on Complex Environments [17.195286118443256]
We have constructed a large-scale dataset for Chinese continuous sign language (CSL) based on complex environments.
This dataset encompasses 5,988 continuous CSL video clips collected from daily life scenes.
We propose a time-frequency network (TFNet) model for continuous sign language recognition.
arXiv Detail & Related papers (2024-09-18T13:11:15Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.
Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.
We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.
Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning [1.6570772838074355]
multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA)
Recent efforts primarily focus on scaling up training datasets through data collection and synthesis.
We propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development.
arXiv Detail & Related papers (2024-07-29T17:04:34Z) - CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare [12.218718086529462]
This study focuses on the Comprehensive Medical Benchmark in Chinese (CMB)
We successfully trained a smaller base model to achieve scores comparable to larger models.
By integrating a wide range of instructional content, our approach addresses potential issues such as data quality inconsistencies.
arXiv Detail & Related papers (2024-07-29T05:00:48Z) - Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language [41.40908753726324]
Diffusion models can generate realistic and diverse images, potentially facilitating data availability for data-intensive perception tasks.
We present textbfAuto textbfCherry-textbfPicker (ACP), a novel framework that generates high-quality cross-modality training samples.
arXiv Detail & Related papers (2024-06-28T17:53:18Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z) - CLEVA: Chinese Language Models EVAluation Platform [92.42981537317817]
We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs.
Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard.
To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round.
Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding.
arXiv Detail & Related papers (2023-08-09T09:11:31Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.