Multilayer Networks for Text Analysis with Multiple Data Types
- URL: http://arxiv.org/abs/2106.15821v1
- Date: Wed, 30 Jun 2021 05:47:39 GMT
- Title: Multilayer Networks for Text Analysis with Multiple Data Types
- Authors: Charles C. Hyland, Yuanming Tao, Lamiae Azizi, Martin Gerlach, Tiago
P. Peixoto, and Eduardo G. Altmann
- Abstract summary: We propose a novel framework based on Multilayer Networks and Block Models.
We show that taking into account multiple types of information provides a more nuanced view on topic- and document-clusters.
- Score: 0.21108097398435335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We are interested in the widespread problem of clustering documents and
finding topics in large collections of written documents in the presence of
metadata and hyperlinks. To tackle the challenge of accounting for these
different types of datasets, we propose a novel framework based on Multilayer
Networks and Stochastic Block Models. The main innovation of our approach over
other techniques is that it applies the same non-parametric probabilistic
framework to the different sources of datasets simultaneously. The key
difference to other multilayer complex networks is the strong unbalance between
the layers, with the average degree of different node types scaling differently
with system size. We show that the latter observation is due to generic
properties of text, such as Heaps' law, and strongly affects the inference of
communities. We present and discuss the performance of our method in different
datasets (hundreds of Wikipedia documents, thousands of scientific papers, and
thousands of E-mails) showing that taking into account multiple types of
information provides a more nuanced view on topic- and document-clusters and
increases the ability to predict missing links.
Related papers
- Flexible inference in heterogeneous and attributed multilayer networks [21.349513661012498]
We develop a probabilistic generative model to perform inference in multilayer networks with arbitrary types of information.
We demonstrate its ability to unveil a variety of patterns in a social support network among villagers in rural India.
arXiv Detail & Related papers (2024-05-31T15:21:59Z) - Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents [31.434507306952458]
We propose KNN-former, which incorporates a new kind of bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities.
We also use matching spatial to address the one-to-one mapping property that exists in many documents.
Our method is highly-efficient compared to existing approaches in terms of the number of trainable parameters.
arXiv Detail & Related papers (2024-05-08T10:10:38Z) - Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z) - ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds [35.09631990817093]
ProSiT is a deterministic and interpretable method that finds the optimal number of latent dimensions.
In most setting, ProSiT matches or outperforms the other methods in terms of topic coherence and distinctiveness.
arXiv Detail & Related papers (2022-10-26T14:52:44Z) - Large-Scale Multi-Document Summarization with Information Extraction and
Compression [31.601707033466766]
We develop an abstractive summarization framework independent of labeled data for multiple heterogeneous documents.
Our framework processes documents telling different stories instead of documents on the same topic.
Our experiments demonstrate that our framework outperforms current state-of-the-art methods in this more generic setting.
arXiv Detail & Related papers (2022-05-01T19:49:15Z) - Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network [49.458250193768826]
We propose sawtooth factorial topic embedding guided GBN, a deep generative model of documents.
Both the words and topics are represented as embedding vectors of the same dimension.
Our models outperform other neural topic models on extracting deeper interpretable topics.
arXiv Detail & Related papers (2021-06-30T10:14:57Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z) - Clustering multilayer graphs with missing nodes [4.007017852999008]
Clustering is a fundamental problem in network analysis where the goal is to regroup nodes with similar connectivity profiles.
We propose a new framework that allows for layers to be defined on different sets of nodes.
arXiv Detail & Related papers (2021-03-04T18:56:59Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - A Multi-Semantic Metapath Model for Large Scale Heterogeneous Network
Representation Learning [52.83948119677194]
We propose a multi-semantic metapath (MSM) model for large scale heterogeneous representation learning.
Specifically, we generate multi-semantic metapath-based random walks to construct the heterogeneous neighborhood to handle the unbalanced distributions.
We conduct systematical evaluations for the proposed framework on two challenging datasets: Amazon and Alibaba.
arXiv Detail & Related papers (2020-07-19T22:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.