Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application
- URL: http://arxiv.org/abs/2009.10277v1
- Date: Tue, 22 Sep 2020 02:15:05 GMT
- Title: Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application
- Authors: Chris J. Kennedy, Geoff Bacon, Alexander Sahn, Claudia von Vacano
- Abstract summary: We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT)
We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
- Score: 63.10266319378212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a general method for measuring complex variables on a continuous,
interval spectrum by combining supervised deep learning with the Constructing
Measures approach to faceted Rasch item response theory (IRT). We decompose the
target construct, hate speech in our case, into multiple constituent components
that are labeled as ordinal survey items. Those survey responses are
transformed via IRT into a debiased, continuous outcome measure. Our method
estimates the survey interpretation bias of the human labelers and eliminates
that influence on the generated continuous measure. We further estimate the
response quality of each labeler using faceted IRT, allowing responses from
low-quality labelers to be removed.
Our faceted Rasch scaling procedure integrates naturally with a multitask
deep learning architecture for automated prediction on new data. The ratings on
the theorized components of the target outcome are used as supervised, ordinal
variables for the neural networks' internal concept learning. We test the use
of an activation function (ordinal softmax) and loss function (ordinal
cross-entropy) designed to exploit the structure of ordinal outcome variables.
Our multitask architecture leads to a new form of model interpretation because
each continuous prediction can be directly explained by the constituent
components in the penultimate layer.
We demonstrate this new method on a dataset of 50,000 social media comments
sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based
Amazon Mechanical Turk workers to measure a continuous spectrum from hate
speech to counterspeech. We evaluate Universal Sentence Encoders, BERT, and
RoBERTa as language representation models for the comment text, and compare our
predictive accuracy to Google Jigsaw's Perspective API models, showing
significant improvement over this standard benchmark.
Related papers
- Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - What's under the hood: Investigating Automatic Metrics on Meeting Summarization [7.234196390284036]
Meeting summarization has become a critical task considering the increase in online interactions.
Current default-used metrics struggle to capture observable errors, showing weak to mid-correlations.
Only a subset reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error's impact on summary quality.
arXiv Detail & Related papers (2024-04-17T07:15:07Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Zero-Shot Automatic Pronunciation Assessment [19.971348810774046]
We propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT.
Experimental results on speechocean762 demonstrate that the proposed method achieves comparable performance to supervised regression baselines.
arXiv Detail & Related papers (2023-05-31T05:17:17Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Evaluating the reliability of acoustic speech embeddings [10.5754802112615]
Speech embeddings are fixed-size acoustic representations of variable-length speech sequences.
Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods.
We find that overall, ABX and MAP correlate with one another and with frequency estimation.
arXiv Detail & Related papers (2020-07-27T13:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.