Efficient Clustering from Distributions over Topics
- URL: http://arxiv.org/abs/2012.08206v1
- Date: Tue, 15 Dec 2020 10:52:19 GMT
- Title: Efficient Clustering from Distributions over Topics
- Authors: Carlos Badenes-Olmedo, Jose-Luis Redondo Garc\'ia, Oscar Corcho
- Abstract summary: We present an approach that relies on the results of a topic modeling algorithm over documents in a collection as a means to identify smaller subsets of documents where the similarity function can be computed.
This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There are many scenarios where we may want to find pairs of textually similar
documents in a large corpus (e.g. a researcher doing literature review, or an
R&D project manager analyzing project proposals). To programmatically discover
those connections can help experts to achieve those goals, but brute-force
pairwise comparisons are not computationally adequate when the size of the
document corpus is too large. Some algorithms in the literature divide the
search space into regions containing potentially similar documents, which are
later processed separately from the rest in order to reduce the number of pairs
compared. However, this kind of unsupervised methods still incur in high
temporal costs. In this paper, we present an approach that relies on the
results of a topic modeling algorithm over the documents in a collection, as a
means to identify smaller subsets of documents where the similarity function
can then be computed. This approach has proved to obtain promising results when
identifying similar documents in the domain of scientific publications. We have
compared our approach against state of the art clustering techniques and with
different configurations for the topic modeling algorithm. Results suggest that
our approach outperforms (> 0.5) the other analyzed techniques in terms of
efficiency.
Related papers
- Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query.
Existing methods such as similarity search and crossencoder models exhibit significant limitations.
We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Relation-aware Ensemble Learning for Knowledge Graph Embedding [68.94900786314666]
We propose to learn an ensemble by leveraging existing methods in a relation-aware manner.
exploring these semantics using relation-aware ensemble leads to a much larger search space than general ensemble methods.
We propose a divide-search-combine algorithm RelEns-DSC that searches the relation-wise ensemble weights independently.
arXiv Detail & Related papers (2023-10-13T07:40:12Z) - SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text [28.36260646471421]
We propose a semi-supervised human-in-the-loop LDA-based method for learning topics that preserve semantically meaningful relationships between documents in low-dimensional projections.
On synthetic corpora, our method yields more interpretable projections than baseline methods with only a fraction of labels provided.
arXiv Detail & Related papers (2023-07-28T05:43:39Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Document Provenance and Authentication through Authorship Classification [5.2545206693029884]
We propose an ensemble-based text-processing framework for the classification of single and multi-authored documents.
The proposed framework incorporates several state-of-the-art text classification algorithms.
The framework is evaluated on a large-scale benchmark dataset.
arXiv Detail & Related papers (2023-03-02T12:26:03Z) - Contextualization for the Organization of Text Documents Streams [0.0]
We present several experiments with some stream analysis methods to explore streams of text documents.
We use only dynamic algorithms to explore, analyze, and organize the flux of text documents.
arXiv Detail & Related papers (2022-05-30T22:25:40Z) - Multi-Vector Models with Textual Guidance for Fine-Grained Scientific
Document Similarity [11.157086694203201]
We present a new scientific document similarity model based on matching fine-grained aspects.
Our model is trained using co-citation contexts that describe related paper aspects as a novel form of textual supervision.
arXiv Detail & Related papers (2021-11-16T11:12:30Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - A Topological Method for Comparing Document Semantics [0.0]
We propose a novel algorithm for comparing semantics similarity between two documents.
Our experiments are conducted on a document dataset with human judges' results.
Our algorithm can produce highly human-consistent results, and also beats most state-of-the-art methods though ties with NLTK.
arXiv Detail & Related papers (2020-12-08T04:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.