CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs
- URL: http://arxiv.org/abs/2510.06039v1
- Date: Tue, 07 Oct 2025 15:33:52 GMT
- Title: CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs
- Authors: Chengwei Wu, Jiapu Wang, Mingyang Gao, Xingrui Zhuo, Jipeng Guo, Runlin Lei, Haoran Luo, Tianyu Chen, Haoyi Zhou, Shirui Pan, Zechao Li,
- Abstract summary: We present a comprehensive benchmark for evaluating Chinese Large Language Models (CB-ECLLM)<n>CB-ECLLM is based on the newly constructed Chinese Data-Text Pair (CDTP) dataset.<n>CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains.
- Score: 71.01843542502438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.
Related papers
- COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes [83.84578306665976]
Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts.<n>We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes.
arXiv Detail & Related papers (2025-10-16T15:01:19Z) - Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective [53.594353527056775]
We propose Chinese Commonsense Multi-hop Reasoning ( CCMOR) to evaluate Large Language Models (LLMs)<n> CCMOR is designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning.<n>We implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions.
arXiv Detail & Related papers (2025-10-09T20:29:00Z) - M3TQA: Massively Multilingual Multitask Table Question Answering [39.99483693397598]
m3TQA-Instruct is a large-scale benchmark spanning 97 languages across diverse language families.<n>We construct m3TQA by curating 50 real-world tables in Chinese and English, then applying a robust six-step translation pipeline powered by DeepSeek and GPT-4o.<n>The benchmark includes 2,916 professionally annotated question-answering pairs across four tasks designed to evaluate nuanced table reasoning capabilities.
arXiv Detail & Related papers (2025-08-22T09:57:40Z) - No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem [8.435879948625105]
We examine how cultural norms, research environments, and institutional practices shape dataset availability and quality.<n>Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections.<n>We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.
arXiv Detail & Related papers (2025-07-06T10:32:32Z) - MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application [118.63802040274999]
MultiFinBen is the first expert-annotated multilingual (five languages) and multimodal benchmark for evaluating LLMs in realistic financial contexts.<n>Financial reasoning tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents.<n> evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings.
arXiv Detail & Related papers (2025-06-16T22:01:49Z) - Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation [45.551223552275424]
Vision-Language Translation is a challenging task that requires accurately recognizing multilingual text embedded in images.<n>We present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics.
arXiv Detail & Related papers (2025-06-13T14:23:38Z) - Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation [20.87296508045343]
We introduce Fuxi, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks.<n>We reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks.<n>Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development.
arXiv Detail & Related papers (2025-03-20T04:26:40Z) - CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models [36.82189550072201]
Existing text-to-table datasets are typically oriented English.
Large language models (LLMs) have shown great success as general task solvers in multi-lingual settings.
We propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task.
arXiv Detail & Related papers (2024-05-20T16:58:02Z) - COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning [37.843051974342124]
We introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification.
We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets.
The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks.
arXiv Detail & Related papers (2024-03-26T19:24:18Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.