TexShape: Information Theoretic Sentence Embedding for Language Models
- URL: http://arxiv.org/abs/2402.05132v2
- Date: Sat, 11 May 2024 20:03:26 GMT
- Title: TexShape: Information Theoretic Sentence Embedding for Language Models
- Authors: Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath,
- Abstract summary: This paper addresses challenges regarding encoding sentences to their optimized representations through the lens of information-theory.
We use empirical estimates of mutual information, using the Donsker-Varadhan definition of Kullback-Leibler divergence.
Our experiments demonstrate significant advancements in preserving maximal targeted information and minimal sensitive information over adverse compression ratios.
- Score: 5.265661844206274
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the exponential growth in data volume and the emergence of data-intensive applications, particularly in the field of machine learning, concerns related to resource utilization, privacy, and fairness have become paramount. This paper focuses on the textual domain of data and addresses challenges regarding encoding sentences to their optimized representations through the lens of information-theory. In particular, we use empirical estimates of mutual information, using the Donsker-Varadhan definition of Kullback-Leibler divergence. Our approach leverages this estimation to train an information-theoretic sentence embedding, called TexShape, for (task-based) data compression or for filtering out sensitive information, enhancing privacy and fairness. In this study, we employ a benchmark language model for initial text representation, complemented by neural networks for information-theoretic compression and mutual information estimations. Our experiments demonstrate significant advancements in preserving maximal targeted information and minimal sensitive information over adverse compression ratios, in terms of predictive accuracy of downstream models that are trained using the compressed data.
Related papers
- Maintaining Informative Coherence: Migrating Hallucinations in Large Language Models via Absorbing Markov Chains [6.920249042435973]
Large Language Models (LLMs) are powerful tools for text generation, translation, and summarization.
LLMs often suffer from hallucinations-instances where they fail to maintain the fidelity and coherence of contextual information.
We propose a novel decoding strategy that leverages absorbing Markov chains to quantify the significance of contextual information.
arXiv Detail & Related papers (2024-10-27T04:51:18Z) - Enhancing AI-based Generation of Software Exploits with Contextual Information [9.327315119028809]
The study employs a dataset comprising real shellcodes to evaluate the models across various scenarios.
The experiments are designed to assess the models' resilience against incomplete descriptions, their proficiency in leveraging context for enhanced accuracy, and their ability to discern irrelevant information.
The models demonstrate an ability to filter out unnecessary context, maintaining high levels of accuracy in the generation of offensive security code.
arXiv Detail & Related papers (2024-08-05T11:52:34Z) - Capturing Pertinent Symbolic Features for Enhanced Content-Based
Misinformation Detection [0.0]
The detection of misleading content presents a significant hurdle due to its extreme linguistic and domain variability.
This paper analyzes the linguistic attributes that characterize this phenomenon and how representative of such features some of the most popular misinformation datasets are.
We demonstrate that the appropriate use of pertinent symbolic knowledge in combination with neural language models is helpful in detecting misleading content.
arXiv Detail & Related papers (2024-01-29T16:42:34Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Toward a Geometrical Understanding of Self-supervised Contrastive
Learning [55.83778629498769]
Self-supervised learning (SSL) is one of the premier techniques to create data representations that are actionable for transfer learning in the absence of human annotations.
Mainstream SSL techniques rely on a specific deep neural network architecture with two cascaded neural networks: the encoder and the projector.
In this paper, we investigate how the strength of the data augmentation policies affects the data embedding.
arXiv Detail & Related papers (2022-05-13T23:24:48Z) - Compressed Predictive Information Coding [6.220929746808418]
We develop a novel information-theoretic framework, Compressed Predictive Information Coding (CPIC), to extract useful representations from dynamic data.
We derive variational bounds of the CPIC loss which induces the latent space to capture information that is maximally predictive.
We demonstrate that CPIC is able to recover the latent space of noisy dynamical systems with low signal-to-noise ratios.
arXiv Detail & Related papers (2022-03-03T22:47:58Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z) - Representation Learning for Sequence Data with Deep Autoencoding
Predictive Components [96.42805872177067]
We propose a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space.
We encourage this latent structure by maximizing an estimate of predictive information of latent feature sequences, which is the mutual information between past and future windows at each time step.
We demonstrate that our method recovers the latent space of noisy dynamical systems, extracts predictive features for forecasting tasks, and improves automatic speech recognition when used to pretrain the encoder on large amounts of unlabeled data.
arXiv Detail & Related papers (2020-10-07T03:34:01Z) - Learning Optimal Representations with the Decodable Information
Bottleneck [43.30367159353152]
In machine learning, our goal is not compression but rather generalization, which is intimately linked to the predictive family or decoder of interest.
We propose the Decodable Information Bottleneck (DIB) that considers information retention and compression from the perspective of the desired predictive family.
As a result, DIB gives rise to representations that are optimal in terms of expected test performance and can be estimated with guarantees.
arXiv Detail & Related papers (2020-09-27T08:33:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.