CommunityFish: A Poisson-based Document Scaling With Hierarchical
Clustering
- URL: http://arxiv.org/abs/2308.14873v1
- Date: Mon, 28 Aug 2023 19:52:18 GMT
- Title: CommunityFish: A Poisson-based Document Scaling With Hierarchical
Clustering
- Authors: Sami Diaf
- Abstract summary: This paper introduces CommunityFish as an augmented version of Wordfish based on a hierarchical clustering, namely the Louvain algorithm, on the word space to yield communities as semantic and independent n-grams emerging from the corpus.
This strategy emphasizes the interpretability of the results, since communities have a non-overlapping structure, hence a crucial informative power in discriminating parties or speakers, in addition to allowing a faster execution of the Poisson scaling model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document scaling has been a key component in text-as-data applications for
social scientists and a major field of interest for political researchers, who
aim at uncovering differences between speakers or parties with the help of
different probabilistic and non-probabilistic approaches. Yet, most of these
techniques are either built upon the agnostically bag-of-word hypothesis or use
prior information borrowed from external sources that might embed the results
with a significant bias. If the corpus has long been considered as a collection
of documents, it can also be seen as a dense network of connected words whose
structure could be clustered to differentiate independent groups of words,
based on their co-occurrences in documents, known as communities. This paper
introduces CommunityFish as an augmented version of Wordfish based on a
hierarchical clustering, namely the Louvain algorithm, on the word space to
yield communities as semantic and independent n-grams emerging from the corpus
and use them as an input to Wordfish method, instead of considering the word
space. This strategy emphasizes the interpretability of the results, since
communities have a non-overlapping structure, hence a crucial informative power
in discriminating parties or speakers, in addition to allowing a faster
execution of the Poisson scaling model. Aside from yielding communities,
assumed to be subtopic proxies, the application of this technique outperforms
the classic Wordfish model by highlighting historical developments in the U.S.
State of the Union addresses and was found to replicate the prevailing
political stance in Germany when using the corpus of parties' legislative
manifestos.
Related papers
- Verified authors shape X/Twitter discursive communities [0.24999074238880484]
We show that the core of ideological/discursive communities on X/Twitter can be effectively identified by uncovering the most informative interactions.
The analysis is performed considering three X/Twitter datasets related to the main political events of 2022 in Italy.
arXiv Detail & Related papers (2024-05-08T09:04:46Z) - Tracing the Genealogies of Ideas with Large Language Model Embeddings [0.0]
I present a novel method to detect intellectual influence across a large corpus.
I apply this method to vectorize sentences from a corpus of roughly 400,000 nonfiction books and academic publications from the 19th century.
arXiv Detail & Related papers (2024-01-13T18:42:27Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Optimizing text representations to capture (dis)similarity between
political parties [1.2891210250935146]
We look at the problem of modeling pairwise similarities between political parties.
Our research question is what level of structural information is necessary to create robust text representation.
We evaluate our models on the manifestos of German parties for the 2021 federal election.
arXiv Detail & Related papers (2022-10-21T14:24:57Z) - Keywords and Instances: A Hierarchical Contrastive Learning Framework
Unifying Hybrid Granularities for Text Generation [59.01297461453444]
We propose a hierarchical contrastive learning mechanism, which can unify hybrid granularities semantic meaning in the input text.
Experiments demonstrate that our model outperforms competitive baselines on paraphrasing, dialogue generation, and storytelling tasks.
arXiv Detail & Related papers (2022-05-26T13:26:03Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - LexSubCon: Integrating Knowledge from Lexical Resources into Contextual
Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models.
This is achieved by combining contextual information with knowledge from structured lexical resources.
Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Data Expansion using Back Translation and Paraphrasing for Hate Speech
Detection [1.192436948211501]
We present a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation.
We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset.
arXiv Detail & Related papers (2021-05-25T09:52:42Z) - Unsupervised Key-phrase Extraction and Clustering for Classification
Scheme in Scientific Publications [0.0]
We investigate possible ways of automating parts of the Systematic Mapping (SM) and Systematic Review (SR) process.
Key-phrases are extracted from scientific documents using unsupervised methods, which are then used to construct the corresponding Classification Scheme.
We also explore how clustering can be used to group related key-phrases.
arXiv Detail & Related papers (2021-01-25T10:17:33Z) - Named Entity Recognition for Social Media Texts with Semantic
Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts.
We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.