A Framework for Authorial Clustering of Shorter Texts in Latent Semantic
Spaces
- URL: http://arxiv.org/abs/2011.15038v1
- Date: Mon, 30 Nov 2020 17:39:44 GMT
- Title: A Framework for Authorial Clustering of Shorter Texts in Latent Semantic
Spaces
- Authors: Rafi Trad, Myra Spiliopoulou
- Abstract summary: Authorial clustering involves grouping documents written by the same author or team of authors without any prior positive examples of an author's writing style or thematic preferences.
We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling.
We report on experiments with 120 collections in three languages and two genres and show that the topic-based latent feature space provides a promising level of performance.
- Score: 4.18804572788063
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Authorial clustering involves the grouping of documents written by the same
author or team of authors without any prior positive examples of an author's
writing style or thematic preferences. For authorial clustering on shorter
texts (paragraph-length texts that are typically shorter than conventional
documents), the document representation is particularly important: very
high-dimensional feature spaces lead to data sparsity and suffer from serious
consequences like the curse of dimensionality, while feature selection may lead
to information loss. We propose a high-level framework which utilizes a compact
data representation in a latent feature space derived with non-parametric topic
modeling. Authorial clusters are identified thereafter in two scenarios: (a)
fully unsupervised and (b) semi-supervised where a small number of shorter
texts are known to belong to the same author (must-link constraints) or not
(cannot-link constraints). We report on experiments with 120 collections in
three languages and two genres and show that the topic-based latent feature
space provides a promising level of performance while reducing the
dimensionality by a factor of 1500 compared to state-of-the-arts. We also
demonstrate that, while prior knowledge on the precise number of authors (i.e.
authorial clusters) does not contribute much to additional quality, little
knowledge on constraints in authorial clusters memberships leads to clear
performance improvements in front of this difficult task. Thorough
experimentation with standard metrics indicates that there still remains an
ample room for improvement for authorial clustering, especially with shorter
texts
Related papers
- SpeciaLex: A Benchmark for In-Context Specialized Lexicon Learning [4.1205832766381985]
SpeciaLex is a benchmark for evaluating a language model's ability to follow specialized lexicon-based constraints.
We present an empirical evaluation of 15 open and closed-source LLMs and discuss insights on how factors such as model scale, openness, setup, and recency affect performance upon evaluating with the benchmark.
arXiv Detail & Related papers (2024-07-18T08:56:02Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Prompting Large Language Models for Topic Modeling [10.31712610860913]
We propose PromptTopic, a novel topic modeling approach that harnesses the advanced language understanding of large language models (LLMs)
It involves extracting topics at the sentence level from individual documents, then aggregating and condensing these topics into a predefined quantity, ultimately providing coherent topics for texts of varying lengths.
We benchmark PromptTopic against the state-of-the-art baselines on three vastly diverse datasets, establishing its proficiency in discovering meaningful topics.
arXiv Detail & Related papers (2023-12-15T11:15:05Z) - Coherent Entity Disambiguation via Modeling Topic and Categorical
Dependency [87.16283281290053]
Previous entity disambiguation (ED) methods adopt a discriminative paradigm, where prediction is made based on matching scores between mention context and candidate entities.
We propose CoherentED, an ED system equipped with novel designs aimed at enhancing the coherence of entity predictions.
We achieve new state-of-the-art results on popular ED benchmarks, with an average improvement of 1.3 F1 points.
arXiv Detail & Related papers (2023-11-06T16:40:13Z) - Unsupervised Summarization with Customized Granularities [76.26899748972423]
We propose the first unsupervised multi-granularity summarization framework, GranuSum.
By inputting different numbers of events, GranuSum is capable of producing multi-granular summaries in an unsupervised manner.
arXiv Detail & Related papers (2022-01-29T05:56:35Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Relation Clustering in Narrative Knowledge Graphs [71.98234178455398]
relational sentences in the original text are embedded (with SBERT) and clustered in order to merge together semantically similar relations.
Preliminary tests show that such clustering might successfully detect similar relations, and provide a valuable preprocessing for semi-supervised approaches.
arXiv Detail & Related papers (2020-11-27T10:43:04Z) - Predicting Themes within Complex Unstructured Texts: A Case Study on
Safeguarding Reports [66.39150945184683]
We focus on the problem of automatically identifying the main themes in a safeguarding report using supervised classification approaches.
Our results show the potential of deep learning models to simulate subject-expert behaviour even for complex tasks with limited labelled data.
arXiv Detail & Related papers (2020-10-27T19:48:23Z) - Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical
Supervision from Extractive Summaries [46.183289748907804]
We propose SOE, a pipelined system that outlines, outlining and elaborating for long text generation.
SOE produces long texts with significantly better quality, along with faster convergence speed.
arXiv Detail & Related papers (2020-10-14T13:22:20Z) - BATS: A Spectral Biclustering Approach to Single Document Topic Modeling
and Segmentation [17.003488045214972]
Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available.
In developing a methodology to handle single documents, we face two major challenges.
First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms.
Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments.
arXiv Detail & Related papers (2020-08-05T16:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.