Unsupervised Sentence-embeddings by Manifold Approximation and
Projection
- URL: http://arxiv.org/abs/2102.03795v1
- Date: Sun, 7 Feb 2021 13:27:58 GMT
- Title: Unsupervised Sentence-embeddings by Manifold Approximation and
Projection
- Authors: Subhradeep Kayal
- Abstract summary: We propose a novel technique to generate sentence-embeddings in an unsupervised fashion by projecting the sentences onto a fixed-dimensional manifold.
We test our approach, which we term EMAP or Embeddings by Manifold Approximation and Projection, on six publicly available text-classification datasets of varying size and complexity.
- Score: 3.04585143845864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The concept of unsupervised universal sentence encoders has gained traction
recently, wherein pre-trained models generate effective task-agnostic
fixed-dimensional representations for phrases, sentences and paragraphs. Such
methods are of varying complexity, from simple weighted-averages of word
vectors to complex language-models based on bidirectional transformers. In this
work we propose a novel technique to generate sentence-embeddings in an
unsupervised fashion by projecting the sentences onto a fixed-dimensional
manifold with the objective of preserving local neighbourhoods in the original
space. To delineate such neighbourhoods we experiment with several set-distance
metrics, including the recently proposed Word Mover's distance, while the
fixed-dimensional projection is achieved by employing a scalable and efficient
manifold approximation method rooted in topological data analysis. We test our
approach, which we term EMAP or Embeddings by Manifold Approximation and
Projection, on six publicly available text-classification datasets of varying
size and complexity. Empirical results show that our method consistently
performs similar to or better than several alternative state-of-the-art
approaches.
Related papers
- Lines of Thought in Large Language Models [3.281128493853064]
Large Language Models achieve next-token prediction by transporting a vectorized piece of text across an accompanying embedding space.
We aim to characterize the statistical properties of ensembles of these 'lines of thought'
We find it remarkable that the vast complexity of such large models can be reduced to a much simpler form, and we reflect on implications.
arXiv Detail & Related papers (2024-10-02T13:31:06Z) - Fast and Scalable Semi-Supervised Learning for Multi-View Subspace Clustering [13.638434337947302]
FSSMSC is a novel solution to the high computational complexity commonly found in existing approaches.
The method generates a consensus anchor graph across all views, representing each data point as a sparse linear combination of chosen landmarks.
The effectiveness and efficiency of FSSMSC are validated through extensive experiments on multiple benchmark datasets of varying scales.
arXiv Detail & Related papers (2024-08-11T06:54:00Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - Explaining text classifiers through progressive neighborhood
approximation with realistic samples [19.26084350822197]
The importance of neighborhood construction in local explanation methods has been highlighted in the literature.
Several attempts have been made to improve neighborhood quality for high-dimensional data, for example, texts, by adopting generative models.
We propose a progressive approximation approach that refines the neighborhood of a to-be-explained decision with a careful two-stage approach.
arXiv Detail & Related papers (2023-02-11T11:42:39Z) - Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic
Approach to Manifold Dimension Estimation [92.81218653234669]
We present new approach to manifold hypothesis checking and underlying manifold dimension estimation.
Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation.
Experiments on real datasets show that the suggested approach based on two methods combination is powerful and effective.
arXiv Detail & Related papers (2021-07-08T15:35:54Z) - Improving Metric Dimensionality Reduction with Distributed Topology [68.8204255655161]
DIPOLE is a dimensionality-reduction post-processing step that corrects an initial embedding by minimizing a loss functional with both a local, metric term and a global, topological term.
We observe that DIPOLE outperforms popular methods like UMAP, t-SNE, and Isomap on a number of popular datasets.
arXiv Detail & Related papers (2021-06-14T17:19:44Z) - Out-of-Manifold Regularization in Contextual Embedding Space for Text
Classification [22.931314501371805]
We propose a new approach to finding and regularizing the remainder of the space, referred to as out-of-manifold.
We synthesize the out-of-manifold embeddings based on two embeddings obtained from actually-observed words.
A discriminator is trained to detect whether an input embedding is located inside the manifold or not, and simultaneously, a generator is optimized to produce new embeddings that can be easily identified as out-of-manifold.
arXiv Detail & Related papers (2021-05-14T10:17:59Z) - Deep Shells: Unsupervised Shape Correspondence with Optimal Transport [52.646396621449]
We propose a novel unsupervised learning approach to 3D shape correspondence.
We show that the proposed method significantly improves over the state-of-the-art on multiple datasets.
arXiv Detail & Related papers (2020-10-28T22:24:07Z) - Closed-Form Factorization of Latent Semantics in GANs [65.42778970898534]
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images.
In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner.
We propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights.
arXiv Detail & Related papers (2020-07-13T18:05:36Z) - Manifold Learning via Manifold Deflation [105.7418091051558]
dimensionality reduction methods provide a valuable means to visualize and interpret high-dimensional data.
Many popular methods can fail dramatically, even on simple two-dimensional Manifolds.
This paper presents an embedding method for a novel, incremental tangent space estimator that incorporates global structure as coordinates.
Empirically, we show our algorithm recovers novel and interesting embeddings on real-world and synthetic datasets.
arXiv Detail & Related papers (2020-07-07T10:04:28Z) - Learning Flat Latent Manifolds with VAEs [16.725880610265378]
We propose an extension to the framework of variational auto-encoders, where the Euclidean metric is a proxy for the similarity between data points.
We replace the compact prior typically used in variational auto-encoders with a recently presented, more expressive hierarchical one.
We evaluate our method on a range of data-sets, including a video-tracking benchmark.
arXiv Detail & Related papers (2020-02-12T09:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.