Related papers: CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering

CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering

URL: http://arxiv.org/abs/2308.14873v1
Date: Mon, 28 Aug 2023 19:52:18 GMT
Title: CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering
Authors: Sami Diaf
Abstract summary: This paper introduces CommunityFish as an augmented version of Wordfish based on a hierarchical clustering, namely the Louvain algorithm, on the word space to yield communities as semantic and independent n-grams emerging from the corpus. This strategy emphasizes the interpretability of the results, since communities have a non-overlapping structure, hence a crucial informative power in discriminating parties or speakers, in addition to allowing a faster execution of the Poisson scaling model.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document scaling has been a key component in text-as-data applications for social scientists and a major field of interest for political researchers, who aim at uncovering differences between speakers or parties with the help of different probabilistic and non-probabilistic approaches. Yet, most of these techniques are either built upon the agnostically bag-of-word hypothesis or use prior information borrowed from external sources that might embed the results with a significant bias. If the corpus has long been considered as a collection of documents, it can also be seen as a dense network of connected words whose structure could be clustered to differentiate independent groups of words, based on their co-occurrences in documents, known as communities. This paper introduces CommunityFish as an augmented version of Wordfish based on a hierarchical clustering, namely the Louvain algorithm, on the word space to yield communities as semantic and independent n-grams emerging from the corpus and use them as an input to Wordfish method, instead of considering the word space. This strategy emphasizes the interpretability of the results, since communities have a non-overlapping structure, hence a crucial informative power in discriminating parties or speakers, in addition to allowing a faster execution of the Poisson scaling model. Aside from yielding communities, assumed to be subtopic proxies, the application of this technique outperforms the classic Wordfish model by highlighting historical developments in the U.S. State of the Union addresses and was found to replicate the prevailing political stance in Germany when using the corpus of parties' legislative manifestos.

Related papers

Deep networks learn to parse uniform-depth context-free languages from local statistics [12.183764229746926]
Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning.<n>We introduce a class of context-free grammars (PCFGs) in which both the degree of ambiguity and the correlation structure across scales can be controlled.<n>We propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.
arXiv Detail & Related papers (2026-01-31T17:35:06Z)
Verified authors shape X/Twitter discursive communities [0.24999074238880484]
We show that the core of ideological/discursive communities on X/Twitter can be effectively identified by uncovering the most informative interactions. The analysis is performed considering three X/Twitter datasets related to the main political events of 2022 in Italy.
arXiv Detail & Related papers (2024-05-08T09:04:46Z)
Tracing the Genealogies of Ideas with Large Language Model Embeddings [0.0]
I present a novel method to detect intellectual influence across a large corpus. I apply this method to vectorize sentences from a corpus of roughly 400,000 nonfiction books and academic publications from the 19th century.
arXiv Detail & Related papers (2024-01-13T18:42:27Z)
Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases. Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding. This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z)
Optimizing text representations to capture (dis)similarity between political parties [1.2891210250935146]
We look at the problem of modeling pairwise similarities between political parties. Our research question is what level of structural information is necessary to create robust text representation. We evaluate our models on the manifestos of German parties for the 2021 federal election.
arXiv Detail & Related papers (2022-10-21T14:24:57Z)
Keywords and Instances: A Hierarchical Contrastive Learning Framework Unifying Hybrid Granularities for Text Generation [59.01297461453444]
We propose a hierarchical contrastive learning mechanism, which can unify hybrid granularities semantic meaning in the input text. Experiments demonstrate that our model outperforms competitive baselines on paraphrasing, dialogue generation, and storytelling tasks.
arXiv Detail & Related papers (2022-05-26T13:26:03Z)
Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z)
LexSubCon: Integrating Knowledge from Lexical Resources into Contextual Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models. This is achieved by combining contextual information with knowledge from structured lexical resources. Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z)
Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document. We also simultaneously cluster users, removing the need for post-hoc cluster estimation. Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z)
Data Expansion using Back Translation and Paraphrasing for Hate Speech Detection [1.192436948211501]
We present a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation. We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset.
arXiv Detail & Related papers (2021-05-25T09:52:42Z)
Unsupervised Key-phrase Extraction and Clustering for Classification Scheme in Scientific Publications [0.0]
We investigate possible ways of automating parts of the Systematic Mapping (SM) and Systematic Review (SR) process. Key-phrases are extracted from scientific documents using unsupervised methods, which are then used to construct the corresponding Classification Scheme. We also explore how clustering can be used to group related key-phrases.
arXiv Detail & Related papers (2021-01-25T10:17:33Z)
Named Entity Recognition for Social Media Texts with Semantic Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts. We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.