Related papers: Supernova Event Dataset: Interpreting Large Language Models' Personality through Critical Event Analysis

Supernova Event Dataset: Interpreting Large Language Models' Personality through Critical Event Analysis

URL: http://arxiv.org/abs/2506.12189v2
Date: Sun, 22 Jun 2025 23:32:27 GMT
Title: Supernova Event Dataset: Interpreting Large Language Models' Personality through Critical Event Analysis
Authors: Pranav Agarwal, Ioana Ciucă,
Abstract summary: Large Language Models (LLMs) are increasingly integrated into everyday applications.<n>In this work, we interpret model personality using our proposed Supernova Event dataset.<n>We evaluate small models like Phi-4, Orca 2, and Qwen 2.5, and large, stronger models such as Claude 3.7, Gemini 2.5, and OpenAI o3.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly integrated into everyday applications. As their influence grows, understanding their decision making and underlying personality becomes essential. In this work, we interpret model personality using our proposed Supernova Event Dataset, a novel dataset with diverse articles spanning biographies, historical events, news, and scientific discoveries. We use this dataset to benchmark LLMs on extracting and ranking key events from text, a subjective and complex challenge that requires reasoning over long-range context and modeling causal chains. We evaluate small models like Phi-4, Orca 2, and Qwen 2.5, and large, stronger models such as Claude 3.7, Gemini 2.5, and OpenAI o3, and propose a framework where another LLM acts as a judge to infer each model's personality based on its selection and classification of events. Our analysis shows distinct personality traits: for instance, Orca 2 demonstrates emotional reasoning focusing on interpersonal dynamics, while Qwen 2.5 displays a more strategic, analytical style. When analyzing scientific discovery events, Claude Sonnet 3.7 emphasizes conceptual framing, Gemini 2.5 Pro prioritizes empirical validation, and o3 favors step-by-step causal reasoning. This analysis improves model interpretability, making them user-friendly for a wide range of diverse applications. Project Page - https://www.supernova-event.ai/

Related papers

Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries [85.909363478929]
In this study, we focus on 19 real-world statistics collected from authoritative sources.<n>We develop a checklist comprising objective and subjective queries to analyze behavior of large language models.<n>We propose metrics to assess factuality and fairness, and formally prove the inherent trade-off between these two aspects.
arXiv Detail & Related papers (2025-02-09T10:54:11Z)
Orca: Enhancing Role-Playing Abilities of Large Language Models by Integrating Personality Traits [4.092862870428798]
We propose Orca, a framework for data processing and training LLMs of custom characters by integrating personality traits. Orca comprises four stages: Personality traits inferring, leverage LLMs to infer user's BigFive personality trait reports and scores. Our experiments demonstrate that our proposed model achieves superior performance on this benchmark.
arXiv Detail & Related papers (2024-11-15T07:35:47Z)
Can Large Language Models do Analytical Reasoning? [45.69642663863077]
This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. We find that GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind. To our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores.
arXiv Detail & Related papers (2024-03-06T20:22:08Z)
SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech [0.0]
This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications. Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper. The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
arXiv Detail & Related papers (2024-03-01T11:28:37Z)
MONAL: Model Autophagy Analysis for Modeling Human-AI Interactions [11.972017738888825]
We propose Model Autophagy Analysis (MONAL) for large models' self-consumption explanation. MONAL employs two distinct autophagous loops to elucidate the suppression of human-generated information in the exchange between human and AI systems. We evaluate the capacities of generated models as both creators and disseminators of information.
arXiv Detail & Related papers (2024-02-17T13:02:54Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
CLEVRER-Humans: Describing Physical and Causal Events the Human Way [51.68416979907198]
We present the CLEVRER-Humans benchmark, a video dataset for causal judgment of physical events with human labels.<n>We employ two techniques to improve data collection efficiency: first, a novel iterative event cloze task to elicit a new representation of events in videos, which we term Causal Event Graphs (CEGs); second, a data augmentation technique based on neural language generative models.
arXiv Detail & Related papers (2023-10-05T16:09:48Z)
Estimating the Personality of White-Box Language Models [0.589889361990138]
Large-scale language models, which are trained on large corpora of text, are being used in a wide range of applications everywhere. Existing research shows that these models can and do capture human biases. Many of these biases, especially those that could potentially cause harm, are being well-investigated. However, studies that infer and change human personality traits inherited by these models have been scarce or non-existent.
arXiv Detail & Related papers (2022-04-25T23:53:53Z)
StyleGAN-Human: A Data-Centric Odyssey of Human Generation [96.7080874757475]
This work takes a data-centric perspective and investigates multiple critical aspects in "data engineering" We collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures. We rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.
arXiv Detail & Related papers (2022-04-25T17:55:08Z)
MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech. MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z)
Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling [81.33107307509718]
We propose a topic adaptive storyteller to model the ability of inter-topic generalization. We also propose a prototype encoding structure to model the ability of intra-topic derivation. Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model.
arXiv Detail & Related papers (2020-08-11T03:55:11Z)
Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization. We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise. We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.