Automated scoring of the Ambiguous Intentions Hostility Questionnaire using fine-tuned large language models
- URL: http://arxiv.org/abs/2508.10007v1
- Date: Tue, 05 Aug 2025 21:58:11 GMT
- Title: Automated scoring of the Ambiguous Intentions Hostility Questionnaire using fine-tuned large language models
- Authors: Y. Lyu, D. Combs, D. Neumann, Y. C. Leong,
- Abstract summary: The Ambiguous Intentions Hostility Questionnaire (AIHQ) is commonly used to measure hostile attribution bias.<n>We assessed whether large language models can automate the scoring of AIHQ open-ended responses.<n>Results showed that model-generated ratings aligned with human ratings for both attributions of hostility and aggression responses.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Hostile attribution bias is the tendency to interpret social interactions as intentionally hostile. The Ambiguous Intentions Hostility Questionnaire (AIHQ) is commonly used to measure hostile attribution bias, and includes open-ended questions where participants describe the perceived intentions behind a negative social situation and how they would respond. While these questions provide insights into the contents of hostile attributions, they require time-intensive scoring by human raters. In this study, we assessed whether large language models can automate the scoring of AIHQ open-ended responses. We used a previously collected dataset in which individuals with traumatic brain injury (TBI) and healthy controls (HC) completed the AIHQ and had their open-ended responses rated by trained human raters. We used half of these responses to fine-tune the two models on human-generated ratings, and tested the fine-tuned models on the remaining half of AIHQ responses. Results showed that model-generated ratings aligned with human ratings for both attributions of hostility and aggression responses, with fine-tuned models showing higher alignment. This alignment was consistent across ambiguous, intentional, and accidental scenario types, and replicated previous findings on group differences in attributions of hostility and aggression responses between TBI and HC groups. The fine-tuned models also generalized well to an independent nonclinical dataset. To support broader adoption, we provide an accessible scoring interface that includes both local and cloud-based options. Together, our findings suggest that large language models can streamline AIHQ scoring in both research and clinical contexts, revealing their potential to facilitate psychological assessments across different populations.
Related papers
- Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions [13.900123583700472]
We study adaptive group elicitation, a multi-round setting where an agent adaptively selects both questions and respondents under explicit query and participation budgets.<n>We propose a theoretically grounded framework that combines (i) an LLM-based expected information gain objective for scoring candidate questions with (ii) heterogeneous graph neural network propagation that aggregates observed responses and participant attributes to impute missing responses and guide per-round respondent selection.<n>Across three real-world opinion datasets, our method consistently improves population-level response prediction under constrained budgets, including a >12% relative gain on CES at a 10% respondent budget.
arXiv Detail & Related papers (2026-02-15T19:05:34Z) - More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment [80.04449725137177]
Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback.<n>Our study reveals a striking, safety-specific phenomenon associated with DPO alignment.<n>Using solely self-generated responses for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models.
arXiv Detail & Related papers (2025-04-03T00:36:40Z) - UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation [36.40760924116748]
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA)<n>Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images.<n>We propose an Unsupervised Peer review MLLM Evaluation framework, which allows models to automatically generate questions and conduct peer review assessments of answers from other models.
arXiv Detail & Related papers (2025-03-19T07:15:41Z) - Users Favor LLM-Generated Content -- Until They Know It's AI [0.0]
We investigate how individuals evaluate human and large langue models generated responses to popular questions when the source of the content is either concealed or disclosed.<n>Our findings indicate that, overall, participants tend to prefer AI-generated responses.<n>When the AI origin is revealed, this preference diminishes significantly, suggesting that evaluative judgments are influenced by the disclosure of the response's provenance.
arXiv Detail & Related papers (2025-02-23T11:14:02Z) - Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence [11.217656140423207]
ASAC aims to evaluate the overall speaking proficiency of an L2 speaker in a setting where an interlocutor interacts with one or more candidates.<n>We propose a hierarchical graph model that aptly incorporates both broad inter-response interactions and nuanced semantic information.<n>Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy.
arXiv Detail & Related papers (2024-09-11T07:24:07Z) - ChatGPT vs Social Surveys: Probing Objective and Subjective Silicon Population [7.281887764378982]
Large Language Models (LLMs) have the potential to simulate human responses in social surveys and generate reliable predictions.<n>We employ repeated random sampling to create sampling distributions that identify the population parameters of silicon samples generated by GPT.<n>Our findings show that GPT's demographic distribution aligns with the 2020 U.S. population in terms of gender and average age.<n> GPT's point estimates for attitudinal scores are highly inconsistent and show no clear inclination toward any particular ideology.
arXiv Detail & Related papers (2024-09-04T10:33:37Z) - Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - HANS, are you clever? Clever Hans Effect Analysis of Neural Systems [1.6267479602370545]
Large Language Models (It-LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively.
Several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models' abilities.
However, earlier works are demonstrating the presence of inherent "order bias" in It-LLMs, posing challenges to the appropriate evaluation.
arXiv Detail & Related papers (2023-09-21T20:52:18Z) - Questioning the Survey Responses of Large Language Models [25.14481433176348]
We critically examine the methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau.<n>We establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A"<n>Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses.
arXiv Detail & Related papers (2023-06-13T17:48:27Z) - Measuring the Effect of Influential Messages on Varying Personas [67.1149173905004]
We present a new task, Response Forecasting on Personas for News Media, to estimate the response a persona might have upon seeing a news message.
The proposed task not only introduces personalization in the modeling but also predicts the sentiment polarity and intensity of each response.
This enables more accurate and comprehensive inference on the mental state of the persona.
arXiv Detail & Related papers (2023-05-25T21:01:00Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.