Documents Are People and Words Are Items: A Psychometric Approach to Textual Data with Contextual Embeddings
- URL: http://arxiv.org/abs/2509.08920v1
- Date: Wed, 10 Sep 2025 18:31:37 GMT
- Title: Documents Are People and Words Are Items: A Psychometric Approach to Textual Data with Contextual Embeddings
- Authors: Jinsong Chen,
- Abstract summary: This research introduces a novel psychometric method for analyzing textual data using large language models.<n>By leveraging contextual embeddings, we transform textual data into response data suitable for psychometric analysis.
- Score: 2.1494179586067537
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This research introduces a novel psychometric method for analyzing textual data using large language models. By leveraging contextual embeddings to create contextual scores, we transform textual data into response data suitable for psychometric analysis. Treating documents as individuals and words as items, this approach provides a natural psychometric interpretation under the assumption that certain keywords, whose contextual meanings vary significantly across documents, can effectively differentiate documents within a corpus. The modeling process comprises two stages: obtaining contextual scores and performing psychometric analysis. In the first stage, we utilize natural language processing techniques and encoder based transformer models to identify common keywords and generate contextual scores. In the second stage, we employ various types of factor analysis, including exploratory and bifactor models, to extract and define latent factors, determine factor correlations, and identify the most significant words associated with each factor. Applied to the Wiki STEM corpus, our experimental results demonstrate the method's potential to uncover latent knowledge dimensions and patterns within textual data. This approach not only enhances the psychometric analysis of textual data but also holds promise for applications in fields rich in textual information, such as education, psychology, and law.
Related papers
- Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results [0.0]
This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings.<n>The research presents methods for ensuring that digital texts are accurately represented in corpora.
arXiv Detail & Related papers (2025-07-02T14:46:26Z) - The Text Classification Pipeline: Starting Shallow going Deeper [4.97309503788908]
The past decade has seen deep learning revolutionize text classification.<n>English is the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others.<n>This work integrates traditional and contemporary text mining methodologies, fostering a holistic understanding of text classification.
arXiv Detail & Related papers (2024-12-30T23:01:19Z) - BookWorm: A Dataset for Character Description and Analysis [59.186325346763184]
We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation.
We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses.
Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
arXiv Detail & Related papers (2024-10-14T10:55:58Z) - An Information-Theoretic Approach for Detecting Edits in AI-Generated Text [7.013432243663526]
We propose a method to determine whether a given article was written entirely by a generative language model or perhaps contains edits by a different author, possibly a human.
We demonstrate the effectiveness of the method in detecting edits through extensive evaluations using real data.
Our analysis raises several interesting research questions at the intersection of information theory and data science.
arXiv Detail & Related papers (2023-08-24T12:49:21Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - An Informational Space Based Semantic Analysis for Scientific Texts [62.997667081978825]
This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts.
The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties.
The research in this paper conducts the base for the geometric representation of the meaning of texts.
arXiv Detail & Related papers (2022-05-31T11:19:32Z) - TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between
Corpora [14.844685568451833]
We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings.
TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface.
arXiv Detail & Related papers (2021-03-19T21:26:28Z) - Improving Machine Reading Comprehension with Contextualized Commonsense
Knowledge [62.46091695615262]
We aim to extract commonsense knowledge to improve machine reading comprehension.
We propose to represent relations implicitly by situating structured knowledge in a context.
We employ a teacher-student paradigm to inject multiple types of contextualized knowledge into a student machine reader.
arXiv Detail & Related papers (2020-09-12T17:20:01Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z) - A computational model implementing subjectivity with the 'Room Theory'.
The case of detecting Emotion from Text [68.8204255655161]
This work introduces a new method to consider subjectivity and general context dependency in text analysis.
By using similarity measure between words, we are able to extract the relative relevance of the elements in the benchmark.
This method could be applied to all the cases where evaluating subjectivity is relevant to understand the relative value or meaning of a text.
arXiv Detail & Related papers (2020-05-12T21:26:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.