Topological Data Analysis in Text Classification: Extracting Features
with Additive Information
- URL: http://arxiv.org/abs/2003.13138v1
- Date: Sun, 29 Mar 2020 21:02:09 GMT
- Title: Topological Data Analysis in Text Classification: Extracting Features
with Additive Information
- Authors: Shafie Gholizadeh, Ketki Savle, Armin Seyeditabari and Wlodek Zadrozny
- Abstract summary: Topological Data Analysis is challenging to apply to high dimensional numeric data.
Topological features carry some exclusive information not captured by conventional text mining methods.
Adding topological features to the conventional features in ensemble models improves the classification results.
- Score: 2.1410799064827226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the strength of Topological Data Analysis has been explored in many
studies on high dimensional numeric data, it is still a challenging task to
apply it to text. As the primary goal in topological data analysis is to define
and quantify the shapes in numeric data, defining shapes in the text is much
more challenging, even though the geometries of vector spaces and conceptual
spaces are clearly relevant for information retrieval and semantics. In this
paper, we examine two different methods of extraction of topological features
from text, using as the underlying representations of words the two most
popular methods, namely word embeddings and TF-IDF vectors. To extract
topological features from the word embedding space, we interpret the embedding
of a text document as high dimensional time series, and we analyze the topology
of the underlying graph where the vertices correspond to different embedding
dimensions. For topological data analysis with the TF-IDF representations, we
analyze the topology of the graph whose vertices come from the TF-IDF vectors
of different blocks in the textual document. In both cases, we apply
homological persistence to reveal the geometric structures under different
distance resolutions. Our results show that these topological features carry
some exclusive information that is not captured by conventional text mining
methods. In our experiments we observe adding topological features to the
conventional features in ensemble models improves the classification results
(up to 5\%). On the other hand, as expected, topological features by themselves
may be not sufficient for effective classification. It is an open problem to
see whether TDA features from word embeddings might be sufficient, as they seem
to perform within a range of few points from top results obtained with a linear
support vector classifier.
Related papers
- Topograph: An efficient Graph-Based Framework for Strictly Topology Preserving Image Segmentation [78.54656076915565]
Topological correctness plays a critical role in many image segmentation tasks.
Most networks are trained using pixel-wise loss functions, such as Dice, neglecting topological accuracy.
We propose a novel, graph-based framework for topologically accurate image segmentation.
arXiv Detail & Related papers (2024-11-05T16:20:14Z) - Dissecting embedding method: learning higher-order structures from data [0.0]
Geometric deep learning methods for data learning often include set of assumptions on the geometry of the feature space.
These assumptions together with data being discrete and finite can cause some generalisations, which are likely to create wrong interpretations of the data and models outputs.
arXiv Detail & Related papers (2024-10-14T08:19:39Z) - Improving embedding of graphs with missing data by soft manifolds [51.425411400683565]
The reliability of graph embeddings depends on how much the geometry of the continuous space matches the graph structure.
We introduce a new class of manifold, named soft manifold, that can solve this situation.
Using soft manifold for graph embedding, we can provide continuous spaces to pursue any task in data analysis over complex datasets.
arXiv Detail & Related papers (2023-11-29T12:48:33Z) - On topological data analysis for structural dynamics: an introduction to
persistent homology [0.0]
Topological data analysis is a method of quantifying the shape of data over a range of length scales.
Persistent homology is a method of quantifying the shape of data over a range of length scales.
arXiv Detail & Related papers (2022-09-12T10:39:38Z) - Hierarchical Heterogeneous Graph Representation Learning for Short Text
Classification [60.233529926965836]
We propose a new method called SHINE, which is based on graph neural network (GNN) for short text classification.
First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs.
Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts.
arXiv Detail & Related papers (2021-10-30T05:33:05Z) - Contrastive analysis for scatter plot-based representations of
dimensionality reduction [0.0]
This paper introduces a methodology to explore multidimensional datasets and interpret clusters' formation.
We also introduce a bipartite graph to visually interpret and explore the relationship between the statistical variables used to understand how the attributes influenced cluster formation.
arXiv Detail & Related papers (2021-01-26T01:16:31Z) - Learning the Implicit Semantic Representation on Graph-Structured Data [57.670106959061634]
Existing representation learning methods in graph convolutional networks are mainly designed by describing the neighborhood of each node as a perceptual whole.
We propose a Semantic Graph Convolutional Networks (SGCN) that explores the implicit semantics by learning latent semantic-paths in graphs.
arXiv Detail & Related papers (2021-01-16T16:18:43Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Argumentative Topology: Finding Loop(holes) in Logic [3.977669302067367]
Topological Word Embeddings uses mathematical techniques in dynamical system analysis and data driven shape extraction.
We show that using a topological delay embedding we are able to capture and extract a different, shape-based notion of logic.
arXiv Detail & Related papers (2020-11-17T21:23:58Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - A Novel Method of Extracting Topological Features from Word Embeddings [2.4063592468412267]
We introduce a novel algorithm to extract topological features from word embedding representation of text.
We will show our defined topological features may outperform conventional text mining features.
arXiv Detail & Related papers (2020-03-29T16:55:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.