CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity
and Infant Care
- URL: http://arxiv.org/abs/2307.01458v4
- Date: Thu, 26 Oct 2023 07:19:58 GMT
- Title: CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity
and Infant Care
- Authors: Tong Xiang, Liangzhi Li, Wangyue Li, Mingbai Bai, Lu Wei, Bowen Wang,
Noa Garcia
- Abstract summary: We present a benchmark, CARE-MI, for evaluating misinformation in large language models (LLMs)
Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models.
Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care.
- Score: 14.326936563564171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent advances in natural language processing (NLP), have led to a new
trend of applying large language models (LLMs) to real-world scenarios. While
the latest LLMs are astonishingly fluent when interacting with humans, they
suffer from the misinformation problem by unintentionally generating factually
false statements. This can lead to harmful consequences, especially when
produced within sensitive contexts, such as healthcare. Yet few previous works
have focused on evaluating misinformation in the long-form (LF) generation of
LLMs, especially for knowledge-intensive topics. Moreover, although LLMs have
been shown to perform well in different languages, misinformation evaluation
has been mostly conducted in English. To this end, we present a benchmark,
CARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic,
specifically the maternity and infant care domain; and 2) a language other than
English, namely Chinese. Most importantly, we provide an innovative paradigm
for building LF generation evaluation benchmarks that can be transferred to
other knowledge-intensive domains and low-resourced languages. Our proposed
benchmark fills the gap between the extensive usage of LLMs and the lack of
datasets for assessing the misinformation generated by these models. It
contains 1,612 expert-checked questions, accompanied with human-selected
references. Using our benchmark, we conduct extensive experiments and found
that current Chinese LLMs are far from perfect in the topic of maternity and
infant care. In an effort to minimize the reliance on human resources for
performance evaluation, we offer off-the-shelf judgment models for
automatically assessing the LF output of LLMs given benchmark questions.
Moreover, we compare potential solutions for LF generation evaluation and
provide insights for building better automated metrics.
Related papers
- Evaluating Large Language Model Capability in Vietnamese Fact-Checking Data Generation [1.0173628293062005]
Large Language Models (LLMs) are being applied to a range of complex language tasks.
This paper explores the use of LLMs for automatic data generation for the Vietnamese fact-checking task.
We develop an automatic data construction process using simple prompt techniques and explore several methods to improve the quality of the generated data.
arXiv Detail & Related papers (2024-11-08T15:35:43Z) - ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding [15.93642619347214]
We introduce ProverbEval, an evaluation benchmark for low-resource languages based on proverbs.
We benchmark various LLMs and explore factors that create variability in the benchmarking process.
We argue special attention must be given to the order of choices, choice of prompt language, task variability, and generation tasks.
arXiv Detail & Related papers (2024-11-07T06:34:48Z) - What do Large Language Models Need for Machine Translation Evaluation? [12.42394213466485]
Large language models (LLMs) can achieve results comparable to fine-tuned multilingual pre-trained language models.
This paper explores what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate machine translation quality.
arXiv Detail & Related papers (2024-10-04T09:50:45Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Are Large Language Models Reliable Judges? A Study on the Factuality
Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities.
This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z) - FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.