Related papers: ChatGPT Rates Natural Language Explanation Quality Like Humans: But on Which Scales?

ChatGPT Rates Natural Language Explanation Quality Like Humans: But on Which Scales?

URL: http://arxiv.org/abs/2403.17368v1
Date: Tue, 26 Mar 2024 04:07:08 GMT
Title: ChatGPT Rates Natural Language Explanation Quality Like Humans: But on Which Scales?
Authors: Fan Huang, Haewoon Kwak, Kunwoo Park, Jisun An,
Abstract summary: This study explores the alignment between ChatGPT and human assessments across multiple scales. We sample 300 data instances from three NLE datasets and collect 900 human annotations for both informativeness and clarity scores. Our results show that ChatGPT aligns better with humans in more coarse-grained scales.
Score: 7.307538454513983
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As AI becomes more integral in our lives, the need for transparency and responsibility grows. While natural language explanations (NLEs) are vital for clarifying the reasoning behind AI decisions, evaluating them through human judgments is complex and resource-intensive due to subjectivity and the need for fine-grained ratings. This study explores the alignment between ChatGPT and human assessments across multiple scales (i.e., binary, ternary, and 7-Likert scale). We sample 300 data instances from three NLE datasets and collect 900 human annotations for both informativeness and clarity scores as the text quality measurement. We further conduct paired comparison experiments under different ranges of subjectivity scores, where the baseline comes from 8,346 human annotations. Our results show that ChatGPT aligns better with humans in more coarse-grained scales. Also, paired comparisons and dynamic prompting (i.e., providing semantically similar examples in the prompt) improve the alignment. This research advances our understanding of large language models' capabilities to assess the text explanation quality in different configurations for responsible AI development.

Related papers

What Makes a Good Natural Language Prompt? [72.3282960118995]
We conduct a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025.<n>We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions.<n>We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact.
arXiv Detail & Related papers (2025-06-07T23:19:27Z)
Analyzing Feedback Mechanisms in AI-Generated MCQs: Insights into Readability, Lexical Properties, and Levels of Challenge [0.0]
This study delves into the linguistic and structural attributes of feedback generated by Google's Gemini 1.5-flash text model for computer science multiple-choice questions (MCQs)<n>Key linguistic metrics, such as length, readability scores (Flesch-Kincaid Grade Level), vocabulary richness, and lexical density, were computed and examined.<n>The findings reveal significant interaction effects between feedback tone and question difficulty, demonstrating the dynamic adaptation of AI-generated feedback within diverse educational contexts.
arXiv Detail & Related papers (2025-04-19T09:20:52Z)
Turing Representational Similarity Analysis (RSA): A Flexible Method for Measuring Alignment Between Human and Artificial Intelligence [0.62914438169038]
We developed Turing Representational Similarity Analysis (RSA), a method that uses pairwise similarity ratings to quantify alignment between AIs and humans. We tested this approach on semantic alignment across text and image modalities, measuring how different Large Language and Vision Language Model (LLM and VLM) similarity judgments aligned with human responses at both group and individual levels.
arXiv Detail & Related papers (2024-11-30T20:24:52Z)
Trying to be human: Linguistic traces of stochastic empathy in language models [0.2638512174804417]
Large language models (LLMs) are crucial drivers behind the increased quality of computer-generated content. Our work tests how two important factors contribute to the human vs AI race: empathy and an incentive to appear human.
arXiv Detail & Related papers (2024-10-02T15:46:40Z)
Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation [48.70176791365903]
This study explores how bias shapes the perception of AI versus human generated content. We investigated how human raters respond to labeled and unlabeled content.
arXiv Detail & Related papers (2024-09-29T04:31:45Z)
Strong and weak alignment of large language models with human values [1.6590638305972631]
Minimizing negative impacts of Artificial Intelligent (AI) systems requires them to be able to align with human values. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. We propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal.
arXiv Detail & Related papers (2024-08-05T11:27:51Z)
From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI [0.0]
We study the effectiveness of large language models (LLMs) on different question answering tasks. We demonstrate the ability of LLMs to reason with commonsense as the models outperform humans on different datasets. Our questionnaire revealed that 66% of participants rated GPT-3.5's explanations as either "good" or "excellent"
arXiv Detail & Related papers (2024-07-04T09:38:49Z)
UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z)
Retrieval-based Disentangled Representation Learning with Natural Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z)
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models. We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
Training Language Models with Natural Language Feedback [51.36137482891037]
We learn from language feedback on model outputs using a three-step learning algorithm. In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements. Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
arXiv Detail & Related papers (2022-04-29T15:06:58Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.