Visual Exploration of Stopword Probabilities in Topic Models
- URL: http://arxiv.org/abs/2501.10137v1
- Date: Fri, 17 Jan 2025 11:59:56 GMT
- Title: Visual Exploration of Stopword Probabilities in Topic Models
- Authors: Shuangjiang Xue, Pierre Le Bras, David A. Robb, Mike J. Chantler, Stefano Padilla,
- Abstract summary: Stopword removal is a critical stage in many Machine Learning methods.
Inappropriately chosen or hastily omitted stopwords not only lead to suboptimal performance but also significantly affect the quality of models.
This paper proposes a novel extraction method that provides a corpus-specific probabilistic estimation of stopword likelihood.
- Score: 1.9107347888374506
- License:
- Abstract: Stopword removal is a critical stage in many Machine Learning methods but often receives little consideration, it interferes with the model visualizations and disrupts user confidence. Inappropriately chosen or hastily omitted stopwords not only lead to suboptimal performance but also significantly affect the quality of models, thus reducing the willingness of practitioners and stakeholders to rely on the output visualizations. This paper proposes a novel extraction method that provides a corpus-specific probabilistic estimation of stopword likelihood and an interactive visualization system to support their analysis. We evaluated our approach and interface using real-world data, a commonly used Machine Learning method (Topic Modelling), and a comprehensive qualitative experiment probing user confidence. The results of our work show that our system increases user confidence in the credibility of topic models by (1) returning reasonable probabilities, (2) generating an appropriate and representative extension of common stopword lists, and (3) providing an adjustable threshold for estimating and analyzing stopwords visually. Finally, we discuss insights, recommendations, and best practices to support practitioners while improving the output of Machine Learning methods and topic model visualizations with robust stopword analysis and removal.
Related papers
- Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.
Models may behave unreliably due to poorly explored failure modes.
causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.
Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.
We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [49.60774626839712]
multimodal generative models have sparked critical discussions on their fairness, reliability, and potential for misuse.
We propose an evaluation framework designed to assess model reliability through their responses to perturbations in the embedding space.
Our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance.
arXiv Detail & Related papers (2024-11-21T09:46:55Z) - Towards Unifying Interpretability and Control: Evaluation via Intervention [25.4582941170387]
We argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model behavior through interventions.
We extend four popular interpretability methods-sparse autoencoders, logit lens, tuned lens, and probing-into an abstract encoder-decoder framework.
We introduce two new evaluation metrics: intervention success rate and coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior.
arXiv Detail & Related papers (2024-11-07T04:52:18Z) - Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - Intuitively Assessing ML Model Reliability through Example-Based
Explanations and Editing Model Inputs [19.09848738521126]
Interpretability methods aim to help users build trust in and understand the capabilities of machine learning models.
We present two interface modules to facilitate a more intuitive assessment of model reliability.
arXiv Detail & Related papers (2021-02-17T02:41:32Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - A Semiparametric Approach to Interpretable Machine Learning [9.87381939016363]
Black box models in machine learning have demonstrated excellent predictive performance in complex problems and high-dimensional settings.
Their lack of transparency and interpretability restrict the applicability of such models in critical decision-making processes.
We propose a novel approach to trading off interpretability and performance in prediction models using ideas from semiparametric statistics.
arXiv Detail & Related papers (2020-06-08T16:38:15Z) - Calibrating Healthcare AI: Towards Reliable and Interpretable Deep
Predictive Models [41.58945927669956]
We argue that these two objectives are not necessarily disparate and propose to utilize prediction calibration to meet both objectives.
Our approach is comprised of a calibration-driven learning method, which is also used to design an interpretability technique based on counterfactual reasoning.
arXiv Detail & Related papers (2020-04-27T22:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.