Related papers: COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

URL: http://arxiv.org/abs/2504.05535v1
Date: Mon, 07 Apr 2025 22:15:51 GMT
Title: COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
Authors: M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin,
Abstract summary: We introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset.<n>It comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role.<n>Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model.
Score: 43.09443095372083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citep{liu2024alignbenchbenchmarkingchinesealignment} show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.

Related papers

DataMan: Data Manager for Pre-training Large Language Models [39.677609311769146]
Existing methods rely on limited intuition, lacking comprehensive and clear guidelines.<n>We derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing.<n>Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B- parameter language model.
arXiv Detail & Related papers (2025-02-26T18:01:19Z)
PERL: Pinyin Enhanced Rephrasing Language Model for Chinese ASR N-best Error Correction [0.0]
We propose a Pinyin Enhanced Rephrasing Language Model (PERL), specifically designed for N-best correction scenarios. We conduct experiments on the Aishell-1 dataset and our newly proposed DoAD dataset.
arXiv Detail & Related papers (2024-12-04T11:28:52Z)
KPC-cF: Aspect-Based Sentiment Analysis via Implicit-Feature Alignment with Corpus Filtering [0.0]
Our research proposes an intuitive and effective framework for ABSA in low-resource languages such as Korean. It optimize prediction labels by integrating translated benchmark and unlabeled Korean data. Compared to English ABSA, our framework showed an approximately 3% difference in F1 scores and accuracy.
arXiv Detail & Related papers (2024-06-29T07:01:51Z)
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning [37.843051974342124]
We introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks.
arXiv Detail & Related papers (2024-03-26T19:24:18Z)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z)
AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment. We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references. For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv Detail & Related papers (2023-11-30T17:41:30Z)
CLEVA: Chinese Language Models EVAluation Platform [92.42981537317817]
We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding.
arXiv Detail & Related papers (2023-08-09T09:11:31Z)
CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models. The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control. Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z)
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.