SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text
- URL: http://arxiv.org/abs/2308.01420v1
- Date: Fri, 28 Jul 2023 05:43:39 GMT
- Title: SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text
- Authors: Charumathi Badrinath, Weiwei Pan, Finale Doshi-Velez
- Abstract summary: We propose a semi-supervised human-in-the-loop LDA-based method for learning topics that preserve semantically meaningful relationships between documents in low-dimensional projections.
On synthetic corpora, our method yields more interpretable projections than baseline methods with only a fraction of labels provided.
- Score: 28.36260646471421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A common way to explore text corpora is through low-dimensional projections
of the documents, where one hopes that thematically similar documents will be
clustered together in the projected space. However, popular algorithms for
dimensionality reduction of text corpora, like Latent Dirichlet Allocation
(LDA), often produce projections that do not capture human notions of document
similarity. We propose a semi-supervised human-in-the-loop LDA-based method for
learning topics that preserve semantically meaningful relationships between
documents in low-dimensional projections. On synthetic corpora, our method
yields more interpretable projections than baseline methods with only a
fraction of labels provided. On a real corpus, we obtain qualitatively similar
results.
Related papers
- Mining Asymmetric Intertextuality [0.0]
Asymmetric intertextuality refers to one-sided relationships between texts.
We propose a scalable and adaptive approach for mining asymmetric intertextuality.
Our system handles intertextuality at various levels, from direct quotations to paraphrasing and cross-document influence.
arXiv Detail & Related papers (2024-10-19T16:12:22Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content.
We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z) - Specialized Document Embeddings for Aspect-based Similarity of Research
Papers [4.661692753666685]
We treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces.
We represent a document not as a single generic embedding but as multiple specialized embeddings.
Our approach mitigates potential risks arising from implicit biases by making them explicit.
arXiv Detail & Related papers (2022-03-28T07:35:26Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Efficient Clustering from Distributions over Topics [0.0]
We present an approach that relies on the results of a topic modeling algorithm over documents in a collection as a means to identify smaller subsets of documents where the similarity function can be computed.
This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications.
arXiv Detail & Related papers (2020-12-15T10:52:19Z) - A Topological Method for Comparing Document Semantics [0.0]
We propose a novel algorithm for comparing semantics similarity between two documents.
Our experiments are conducted on a document dataset with human judges' results.
Our algorithm can produce highly human-consistent results, and also beats most state-of-the-art methods though ties with NLTK.
arXiv Detail & Related papers (2020-12-08T04:21:40Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Document Network Projection in Pretrained Word Embedding Space [7.455546102930911]
We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents into a pretrained word embedding space.
We leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph)
The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering.
arXiv Detail & Related papers (2020-01-16T10:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.