Individual Text Corpora Predict Openness, Interests, Knowledge and Level of Education
- URL: http://arxiv.org/abs/2404.00165v1
- Date: Fri, 29 Mar 2024 21:44:24 GMT
- Title: Individual Text Corpora Predict Openness, Interests, Knowledge and Level of Education
- Authors: Markus J. Hofmann, Markus T. Jansen, Christoph Wigbels, Benny Briesemeister, Arthur M. Jacobs,
- Abstract summary: Personality dimension of openness to experience can be predicted from the individual google search history.
Individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens.
- Score: 0.5825410941577593
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Here we examine whether the personality dimension of openness to experience can be predicted from the individual google search history. By web scraping, individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens. We trained word2vec models and used the similarities of each IC to label words, which were derived from a lexical approach of personality. These IC-label-word similarities were utilized as predictive features in neural models. For training and validation, we relied on 179 participants and held out a test sample of 35 participants. A grid search with varying number of predictive features, hidden units and boost factor was performed. As model selection criterion, we used R2 in the validation samples penalized by the absolute R2 difference between training and validation. The selected neural model explained 35% of the openness variance in the test sample, while an ensemble model with the same architecture often provided slightly more stable predictions for intellectual interests, knowledge in humanities and level of education. Finally, a learning curve analysis suggested that around 500 training participants are required for generalizable predictions. We discuss ICs as a complement or replacement of survey-based psychodiagnostics.
Related papers
- Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language [7.6109649792432315]
This study illustrates how variations in test and train set sizes impact performance in a controlled study.
Results show that test sizes below 1K samples gave noisy results, even for larger training set sizes.
Training set sizes of at least 2K were needed for stable results.
arXiv Detail & Related papers (2024-12-31T19:32:25Z) - QuRating: Selecting High-Quality Data for Training Language Models [64.83332850645074]
We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality.
In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value.
We train a Qur model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
arXiv Detail & Related papers (2024-02-15T06:36:07Z) - Is my Data in your AI Model? Membership Inference Test with Application to Face Images [18.402616111394842]
This article introduces the Membership Inference Test (MINT), a novel approach that aims to empirically assess if given data was used during the training of AI/ML models.
We propose two MINT architectures designed to learn the distinct activation patterns that emerge when an Audited Model is exposed to data used during its training process.
Experiments are carried out using six publicly available databases, comprising over 22 million face images in total.
arXiv Detail & Related papers (2024-02-14T15:09:01Z) - A Predictive Model of Digital Information Engagement: Forecasting User
Engagement With English Words by Incorporating Cognitive Biases,
Computational Linguistics and Natural Language Processing [3.09766013093045]
This study introduces and empirically tests a novel predictive model for digital information engagement (IE)
The READ model integrates key cognitive biases with computational linguistics and natural language processing to develop a multidimensional perspective on information engagement.
The READ model's potential extends across various domains, including business, education, government, and healthcare.
arXiv Detail & Related papers (2023-07-26T20:58:47Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Scaling Laws for Generative Mixed-Modal Language Models [103.25737824352949]
We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them.
Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws.
We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities.
arXiv Detail & Related papers (2023-01-10T00:20:06Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Plinko: A Theory-Free Behavioral Measure of Priors for Statistical
Learning and Mental Model Updating [62.997667081978825]
We present three experiments using "Plinko", a behavioral task in which participants estimate distributions of ball drops over all available outcomes.
We show that participant priors cluster around prototypical probability distributions and that prior cluster membership may indicate learning ability.
We verify that individual participant priors are reliable representations and that learning is not impeded when faced with a physically implausible ball drop distribution.
arXiv Detail & Related papers (2021-07-23T22:27:30Z) - A framework for predicting, interpreting, and improving Learning
Outcomes [0.0]
We develop an Embibe Score Quotient model (ESQ) to predict test scores based on observed academic, behavioral and test-taking features of a student.
ESQ can be used to predict the future scoring potential of a student as well as offer personalized learning nudges.
arXiv Detail & Related papers (2020-10-06T11:22:27Z) - On the Predictive Power of Neural Language Models for Human Real-Time
Comprehension Behavior [29.260666424382446]
We test over two dozen models on how well their next-word expectations predict human reading time on naturalistic text corpora.
We evaluate how features of these models determine their psychometric predictive power, or ability to predict human reading behavior.
For any given perplexity, deep Transformer models and n-gram models show superior psychometric predictive power over LSTM or structurally supervised neural models.
arXiv Detail & Related papers (2020-06-02T19:47:01Z) - Personality Assessment from Text for Machine Commonsense Reasoning [15.348792748868643]
PerSense is a framework to estimate human personality traits based on expressed texts.
Our goal is to demonstrate the feasibility of using machine learning algorithms on personality trait data.
arXiv Detail & Related papers (2020-04-15T07:30:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.