An Empirical Study on Large Language Models in Accuracy and Robustness
under Chinese Industrial Scenarios
- URL: http://arxiv.org/abs/2402.01723v1
- Date: Sat, 27 Jan 2024 03:37:55 GMT
- Title: An Empirical Study on Large Language Models in Accuracy and Robustness
under Chinese Industrial Scenarios
- Authors: Zongjie Li, Wenying Qiu, Pingchuan Ma, Yichen Li, You Li, Sijia He,
Baozheng Jiang, Shuai Wang, Weixi Gu
- Abstract summary: One of the key future applications of large language models (LLMs) will be practical deployment in industrial production.
We present a comprehensive empirical study on the accuracy and robustness of LLMs in the context of the Chinese industrial production area.
- Score: 14.335979063157522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed the rapid development of large language models
(LLMs) in various domains. To better serve the large number of Chinese users,
many commercial vendors in China have adopted localization strategies, training
and providing local LLMs specifically customized for Chinese users.
Furthermore, looking ahead, one of the key future applications of LLMs will be
practical deployment in industrial production by enterprises and users in those
sectors. However, the accuracy and robustness of LLMs in industrial scenarios
have not been well studied. In this paper, we present a comprehensive empirical
study on the accuracy and robustness of LLMs in the context of the Chinese
industrial production area. We manually collected 1,200 domain-specific
problems from 8 different industrial sectors to evaluate LLM accuracy.
Furthermore, we designed a metamorphic testing framework containing four
industrial-specific stability categories with eight abilities, totaling 13,631
questions with variants to evaluate LLM robustness. In total, we evaluated 9
different LLMs developed by Chinese vendors, as well as four different LLMs
developed by global vendors. Our major findings include: (1) Current LLMs
exhibit low accuracy in Chinese industrial contexts, with all LLMs scoring less
than 0.6. (2) The robustness scores vary across industrial sectors, and local
LLMs overall perform worse than global ones. (3) LLM robustness differs
significantly across abilities. Global LLMs are more robust under
logical-related variants, while advanced local LLMs perform better on problems
related to understanding Chinese industrial terminology. Our study results
provide valuable guidance for understanding and promoting the industrial domain
capabilities of LLMs from both development and industrial enterprise
perspectives. The results further motivate possible research directions and
tooling support.
Related papers
- MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection [66.05200339481115]
We present MMAD, the first-ever full-spectrum MLLMs benchmark in industrial anomaly detection.
We defined seven key subtasks of MLLMs in industrial inspection and designed a novel pipeline to generate the MMAD dataset.
With MMAD, we have conducted a comprehensive, quantitative evaluation of various state-of-the-art MLLMs.
arXiv Detail & Related papers (2024-10-12T09:16:09Z) - OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety [37.07970624135514]
OpenEval is an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety.
For capability assessment, we include 12 benchmark datasets to evaluate Chinese LLMs from 4 sub-dimensions: NLP tasks, disciplinary knowledge, commonsense reasoning and mathematical reasoning.
For alignment assessment, OpenEval contains 7 datasets that examines the bias, offensiveness and illegalness in the outputs yielded by Chinese LLMs.
arXiv Detail & Related papers (2024-03-18T23:21:37Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.