WMDecompose: A Framework for Leveraging the Interpretable Properties of
Word Mover's Distance in Sociocultural Analysis
- URL: http://arxiv.org/abs/2110.07330v1
- Date: Thu, 14 Oct 2021 13:04:38 GMT
- Title: WMDecompose: A Framework for Leveraging the Interpretable Properties of
Word Mover's Distance in Sociocultural Analysis
- Authors: Mikael Brunila and Jack LaViolette
- Abstract summary: One popular model that balances legibility and interpretability is Word Mover's Distance (WMD)
We introduce WMDecompose: a model and Python library that decomposes document-level distances into their constituent word-level distances, and subsequently clusters words to induce thematic elements.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the increasing popularity of NLP in the humanities and social
sciences, advances in model performance and complexity have been accompanied by
concerns about interpretability and explanatory power for sociocultural
analysis. One popular model that balances complexity and legibility is Word
Mover's Distance (WMD). Ostensibly adapted for its interpretability, WMD has
nonetheless been used and further developed in ways which frequently discard
its most interpretable aspect: namely, the word-level distances required for
translating a set of words into another set of words. To address this apparent
gap, we introduce WMDecompose: a model and Python library that 1) decomposes
document-level distances into their constituent word-level distances, and 2)
subsequently clusters words to induce thematic elements, such that useful
lexical information is retained and summarized for analysis. To illustrate its
potential in a social scientific context, we apply it to a longitudinal social
media corpus to explore the interrelationship between conspiracy theories and
conservative American discourses. Finally, because of the full WMD model's high
time-complexity, we additionally suggest a method of sampling document pairs
from large datasets in a reproducible way, with tight bounds that prevent
extrapolation of unreliable results due to poor sampling practices.
Related papers
- Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs [0.41436032949434404]
We develop and rigorously evaluate new detection methods for issue framing and narrative analysis within large text datasets.
We show that issue framing can be reliably and efficiently detected in large corpora with only a few examples of either perspective on a given issue.
arXiv Detail & Related papers (2024-08-19T07:14:15Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Transition-based Abstract Meaning Representation Parsing with Contextual
Embeddings [0.0]
We study a way of combing two of the most successful routes to meaning of language--statistical language models and symbolic semantics formalisms--in the task of semantic parsing.
We explore the utility of incorporating pretrained context-aware word embeddings--such as BERT and RoBERTa--in the problem of parsing.
arXiv Detail & Related papers (2022-06-13T15:05:24Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Did the Cat Drink the Coffee? Challenging Transformers with Generalized
Event Knowledge [59.22170796793179]
Transformers Language Models (TLMs) were tested on a benchmark for the textitdynamic estimation of thematic fit
Our results show that TLMs can reach performances that are comparable to those achieved by SDM.
However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge.
arXiv Detail & Related papers (2021-07-22T20:52:26Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Semi-Supervised Joint Estimation of Word and Document Readability [6.34044741105807]
We propose to jointly estimate word and document difficulty through a graph convolutional network (GCN)
Our experimental results reveal that the GCN-based method can achieve higher accuracy than strong baselines, and stays robust even with a smaller amount of labeled data.
arXiv Detail & Related papers (2021-04-27T10:56:47Z) - Pareto Probing: Trading Off Accuracy for Complexity [87.09294772742737]
We argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance.
Our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.
arXiv Detail & Related papers (2020-10-05T17:27:31Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.