NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism
- URL: http://arxiv.org/abs/2403.00862v4
- Date: Tue, 4 Jun 2024 14:50:58 GMT
- Title: NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism
- Authors: Miao Li, Ming-Bin Chen, Bo Tang, Shengbin Hou, Pengyu Wang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Keming Mao, Peng Cheng, Yi Luo,
- Abstract summary: We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism.
Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence.
We propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence.
- Score: 28.443004656952343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of ten popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations.
Related papers
- Measuring Large Language Models Capacity to Annotate Journalistic Sourcing [11.22185665245128]
This paper lays out a scenario to evaluate Large Language Models on identifying and annotating sourcing in news stories.
Our accuracy findings indicate LLM-based approaches have more catching to do in identifying all the sourced statements in a story, and equally, in matching the type of sources.
arXiv Detail & Related papers (2024-12-30T22:15:57Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Evaluating AI-Generated Essays with GRE Analytical Writing Assessment [15.993966092824335]
This study examines essays generated by ten leading LLMs for the analytical writing assessment of the Graduate Record Exam (GRE)
We assessed these essays using both human raters and the e-rater automated scoring engine as used in the GRE scoring pipeline.
The top-performing Gemini and GPT-4o received an average score of 4.78 and 4.67, respectively.
arXiv Detail & Related papers (2024-10-22T21:30:58Z) - INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness [110.6921470281479]
We introduce INDICT: a new framework that empowers large language models with Internal Dialogues of Critiques for both safety and helpfulness guidance.
The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic.
We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes.
arXiv Detail & Related papers (2024-06-23T15:55:07Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models [39.97454990633856]
We present MLLMGuard, a multidimensional safety evaluation suite for MLLMs.
It includes a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator.
Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.
arXiv Detail & Related papers (2024-06-11T13:41:33Z) - Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers [25.268709339109893]
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories.
We work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models)
We compare GPT-4, Claude-2.1, and LLama-2-70B and find that all three models make faithfulness mistakes in over 50% of summaries.
arXiv Detail & Related papers (2024-03-02T01:52:14Z) - CValues: Measuring the Values of Chinese Large Language Models from
Safety to Responsibility [62.74405775089802]
We present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs.
As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains.
Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility.
arXiv Detail & Related papers (2023-07-19T01:22:40Z) - Benchmarking Foundation Models with Language-Model-as-an-Examiner [47.345760054595246]
We propose a novel benchmarking framework, Language-Model-as-an-Examiner.
The LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner.
arXiv Detail & Related papers (2023-06-07T06:29:58Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.