Code-Based English Models Surprising Performance on Chinese QA Pair
Extraction Task
- URL: http://arxiv.org/abs/2401.10286v3
- Date: Mon, 11 Mar 2024 01:23:47 GMT
- Title: Code-Based English Models Surprising Performance on Chinese QA Pair
Extraction Task
- Authors: Linghan Zheng, Hui Liu, Xiaojun Lin, Jiayuan Dong, Yue Sheng, Gang
Shi, Zhiwei Liu, Hongwei Chen
- Abstract summary: Code-based models consistently perform better than text-based models in reasoning-intensive scenarios.
Code-based models containing a certain amount of Chinese data achieve even better performance.
The capabilities of code-based English models in specified Chinese tasks offer a distinct perspective for discussion on the philosophical "Chinese Room" thought experiment.
- Score: 17.117337927315315
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In previous studies, code-based models have consistently outperformed
text-based models in reasoning-intensive scenarios. When generating our
knowledge base for Retrieval-Augmented Generation (RAG), we observed that
code-based models also perform exceptionally well in Chinese QA Pair Extraction
task. Further, our experiments and the metrics we designed discovered that
code-based models containing a certain amount of Chinese data achieve even
better performance. Additionally, the capabilities of code-based English models
in specified Chinese tasks offer a distinct perspective for discussion on the
philosophical "Chinese Room" thought experiment.
Related papers
- WenyanGPT: A Large Language Model for Classical Chinese Tasks [36.380841559581945]
Existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese.
By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks.
arXiv Detail & Related papers (2025-04-29T10:19:05Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language [41.40908753726324]
Diffusion-based models have shown great potential in generating high-quality images with various layouts.
We present Auto Cherry-Picker, a novel framework that generates high-quality multi-modal training examples.
In particular, we present a new metric, Composite Layout and Image Score (CLIS), to evaluate the generated images fairly.
arXiv Detail & Related papers (2024-06-28T17:53:18Z) - A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks [30.54635848057259]
This paper conducts a comprehensive evaluation of well-known and high-performing large language models (LLMs)
We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization.
Our study reports both automatic results, accompanied by a detailed analysis.
arXiv Detail & Related papers (2024-05-16T16:56:54Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - Towards Better Instruction Following Language Models for Chinese:
Investigating the Impact of Training Data and Evaluation [12.86275938443485]
We examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance.
We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios.
We extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3.
arXiv Detail & Related papers (2023-04-16T18:37:39Z) - LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations.
We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation.
We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z) - Code to Comment Translation: A Comparative Study on Model Effectiveness
& Errors [19.653423881863834]
Machine translation models are employed to "translate" code snippets into relevant natural language descriptions.
Most evaluations of such models are conducted using automatic reference-based metrics.
We compare three recently proposed source code summarization models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics.
Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy.
arXiv Detail & Related papers (2021-06-15T20:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.