Human-in-the-loop Evaluation for Early Misinformation Detection: A Case
Study of COVID-19 Treatments
- URL: http://arxiv.org/abs/2212.09683v4
- Date: Mon, 3 Jul 2023 07:04:45 GMT
- Title: Human-in-the-loop Evaluation for Early Misinformation Detection: A Case
Study of COVID-19 Treatments
- Authors: Ethan Mendes, Yang Chen, Wei Xu, Alan Ritter
- Abstract summary: We present a human-in-the-loop evaluation framework for fact-checking novel misinformation claims and identifying social media messages that support them.
Our approach extracts check-worthy claims, which are aggregated and ranked for review.
Stance classifiers are then used to identify tweets supporting novel misinformation claims, which are further reviewed to determine whether they violate relevant policies.
- Score: 19.954539961446496
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a human-in-the-loop evaluation framework for fact-checking novel
misinformation claims and identifying social media messages that support them.
Our approach extracts check-worthy claims, which are aggregated and ranked for
review. Stance classifiers are then used to identify tweets supporting novel
misinformation claims, which are further reviewed to determine whether they
violate relevant policies. To demonstrate the feasibility of our approach, we
develop a baseline system based on modern NLP methods for human-in-the-loop
fact-checking in the domain of COVID-19 treatments. We make our data and
detailed annotation guidelines available to support the evaluation of
human-in-the-loop systems that identify novel misinformation directly from raw
user-generated content.
Related papers
- Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation [43.21663407946184]
Only 29.84% of recent papers involving human evaluation at top conferences release their evaluation guidelines.
We collect annotations of guidelines extracted from existing papers as well as generated via Large Language Models.
We introduce a taxonomy of eight vulnerabilities and formulate a principle for composing evaluation guidelines.
arXiv Detail & Related papers (2024-06-12T06:59:31Z) - Towards Reliable and Factual Response Generation: Detecting Unanswerable
Questions in Information-Seeking Conversations [16.99952884041096]
Generative AI models face the challenge of hallucinations that can undermine users' trust in such systems.
We approach the problem of conversational information seeking as a two-step process, where relevant passages in a corpus are identified first and then summarized into a final system response.
Specifically, our proposed method employs a sentence-level classifier to detect if the answer is present, then aggregates these predictions on the passage level, and eventually across the top-ranked passages to arrive at a final answerability estimate.
arXiv Detail & Related papers (2024-01-21T10:15:36Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - Towards a Comprehensive Human-Centred Evaluation Framework for
Explainable AI [1.7222662622390634]
We propose to adapt the User-Centric Evaluation Framework used in recommender systems.
We integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties.
arXiv Detail & Related papers (2023-07-31T09:20:16Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Counterfactually Evaluating Explanations in Recommender Systems [14.938252589829673]
We propose an offline evaluation method that can be computed without human involvement.
We show that, compared to conventional methods, our method can produce evaluation scores more correlated with the real human judgments.
arXiv Detail & Related papers (2022-03-02T18:55:29Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z) - Interpretable Off-Policy Evaluation in Reinforcement Learning by
Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education.
Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding.
We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.