On Geodesic Distances and Contextual Embedding Compression for Text
Classification
- URL: http://arxiv.org/abs/2104.11295v1
- Date: Thu, 22 Apr 2021 19:30:06 GMT
- Title: On Geodesic Distances and Contextual Embedding Compression for Text
Classification
- Authors: Rishi Jha and Kai Mihata
- Abstract summary: In some memory-constrained settings, it can be advantageous to have smaller contextual embeddings.
We investigate the efficacy of projecting contextual embedding data onto a manifold, and using nonlinear dimensionality reduction techniques to compress these embeddings.
In particular, we propose a novel post-processing approach, applying a combination of Isomap and PCA.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In some memory-constrained settings like IoT devices and over-the-network
data pipelines, it can be advantageous to have smaller contextual embeddings.
We investigate the efficacy of projecting contextual embedding data (BERT) onto
a manifold, and using nonlinear dimensionality reduction techniques to compress
these embeddings. In particular, we propose a novel post-processing approach,
applying a combination of Isomap and PCA. We find that the geodesic distance
estimations, estimates of the shortest path on a Riemannian manifold, from
Isomap's k-Nearest Neighbors graph bolstered the performance of the compressed
embeddings to be comparable to the original BERT embeddings. On one dataset, we
find that despite a 12-fold dimensionality reduction, the compressed embeddings
performed within 0.1% of the original BERT embeddings on a downstream
classification task. In addition, we find that this approach works particularly
well on tasks reliant on syntactic data, when compared with linear
dimensionality reduction. These results show promise for a novel geometric
approach to achieve lower dimensional text embeddings from existing
transformers and pave the way for data-specific and application-specific
embedding compressions.
Related papers
- Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images [60.42768987736088]
We introduce a benchmark that equitably evaluates methodologies across both distillation and pruning literatures.
Our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, even randomly selected subsets can achieve surprisingly competitive performance.
We propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively.
arXiv Detail & Related papers (2025-02-10T13:11:40Z) - Point Cloud Compression with Bits-back Coding [32.9521748764196]
This paper specializes in using a deep learning-based probabilistic model to estimate the Shannon's entropy of the point cloud information.
Once the entropy of the point cloud dataset is estimated, we use the learned CVAE model to compress the geometric attributes of the point clouds.
The novelty of our method with bits-back coding specializes in utilizing the learned latent variable model of the CVAE to compress the point cloud data.
arXiv Detail & Related papers (2024-10-09T06:34:48Z) - Hierarchical Features Matter: A Deep Exploration of GAN Priors for Improved Dataset Distillation [51.44054828384487]
We propose a novel parameterization method dubbed Hierarchical Generative Latent Distillation (H-GLaD)
This method systematically explores hierarchical layers within the generative adversarial networks (GANs)
In addition, we introduce a novel class-relevant feature distance metric to alleviate the computational burden associated with synthetic dataset evaluation.
arXiv Detail & Related papers (2024-06-09T09:15:54Z) - CBMAP: Clustering-based manifold approximation and projection for dimensionality reduction [0.0]
Dimensionality reduction methods are employed to decrease data dimensionality.
This study introduces a clustering-based approach, namely CBMAP, for dimensionality reduction.
CBMAP aims to preserve both global and local structures, ensuring that clusters in lower-dimensional spaces closely resemble those in high-dimensional spaces.
arXiv Detail & Related papers (2024-04-27T15:44:21Z) - Deep Manifold Graph Auto-Encoder for Attributed Graph Embedding [51.75091298017941]
This paper proposes a novel Deep Manifold (Variational) Graph Auto-Encoder (DMVGAE/DMGAE) for attributed graph data.
The proposed method surpasses state-of-the-art baseline algorithms by a significant margin on different downstream tasks across popular datasets.
arXiv Detail & Related papers (2024-01-12T17:57:07Z) - Dataset Condensation with Latent Space Knowledge Factorization and
Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset.
Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes.
We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z) - Hierarchical Nearest Neighbor Graph Embedding for Efficient
Dimensionality Reduction [25.67957712837716]
We introduce a novel method based on a hierarchy built on 1-nearest neighbor graphs in the original space.
The proposal is an optimization-free projection that is competitive with the latest versions of t-SNE and UMAP.
In the paper, we argue about the soundness of the proposed method and evaluate it on a diverse collection of datasets with sizes varying from 1K to 11M samples and dimensions from 28 to 16K.
arXiv Detail & Related papers (2022-03-24T11:41:16Z) - Topology-Preserving Dimensionality Reduction via Interleaving
Optimization [10.097180927318703]
We show how optimization seeking to minimize the interleaving distance can be incorporated into dimensionality reduction algorithms.
We demonstrate the utility of this framework to data visualization.
arXiv Detail & Related papers (2022-01-31T06:11:17Z) - Deep Recursive Embedding for High-Dimensional Data [9.611123249318126]
We propose to combine deep neural networks (DNN) with mathematics-guided embedding rules for high-dimensional data embedding.
We introduce a generic deep embedding network (DEN) framework, which is able to learn a parametric mapping from high-dimensional space to low-dimensional space.
arXiv Detail & Related papers (2021-10-31T23:22:33Z) - MuSCLE: Multi Sweep Compression of LiDAR using Deep Entropy Models [78.93424358827528]
We present a novel compression algorithm for reducing the storage streams of LiDAR sensor data.
Our method significantly reduces the joint geometry and intensity over prior state-of-the-art LiDAR compression methods.
arXiv Detail & Related papers (2020-11-15T17:41:14Z) - Manifold Learning via Manifold Deflation [105.7418091051558]
dimensionality reduction methods provide a valuable means to visualize and interpret high-dimensional data.
Many popular methods can fail dramatically, even on simple two-dimensional Manifolds.
This paper presents an embedding method for a novel, incremental tangent space estimator that incorporates global structure as coordinates.
Empirically, we show our algorithm recovers novel and interesting embeddings on real-world and synthetic datasets.
arXiv Detail & Related papers (2020-07-07T10:04:28Z) - Optimizing Vessel Trajectory Compression [71.42030830910227]
In previous work we introduced a trajectory detection module that can provide summarized representations of vessel trajectories by consuming AIS positional messages online.
This methodology can provide reliable trajectory synopses with little deviations from the original course by discarding at least 70% of the raw data as redundant.
However, such trajectory compression is very sensitive to parametrization.
We take into account the type of each vessel in order to provide a suitable configuration that can yield improved trajectory synopses.
arXiv Detail & Related papers (2020-05-11T20:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.