Factor Augmented Supervised Learning with Text Embeddings
- URL: http://arxiv.org/abs/2508.06548v1
- Date: Wed, 06 Aug 2025 01:44:47 GMT
- Title: Factor Augmented Supervised Learning with Text Embeddings
- Authors: Zhanye Luo, Yuefeng Han, Xiufan Yu,
- Abstract summary: AutoEncoder-Augmented Learning with Text (AEALT) is a supervised, factor-augmented framework that incorporates dimension reduction directly into pre-trained large language models (LLMs)<n>AEALT outperforms conventional deep-learning approaches that rely on raw embeddings.<n>We validate its broad applicability with extensive experiments on classification, anomaly detection, and prediction tasks.
- Score: 3.0040661953201475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) generate text embeddings from text data, producing vector representations that capture the semantic meaning and contextual relationships of words. However, the high dimensionality of these embeddings often impedes efficiency and drives up computational cost in downstream tasks. To address this, we propose AutoEncoder-Augmented Learning with Text (AEALT), a supervised, factor-augmented framework that incorporates dimension reduction directly into pre-trained LLM workflows. First, we extract embeddings from text documents; next, we pass them through a supervised augmented autoencoder to learn low-dimensional, task-relevant latent factors. By modeling the nonlinear structure of complex embeddings, AEALT outperforms conventional deep-learning approaches that rely on raw embeddings. We validate its broad applicability with extensive experiments on classification, anomaly detection, and prediction tasks using multiple real-world public datasets. Numerical results demonstrate that AEALT yields substantial gains over both vanilla embeddings and several standard dimension reduction methods.
Related papers
- Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions [62.02112656288921]
reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios.<n>We learn a compact latent action space for RL fine-tuning instead.<n>We leverage both paired image-text data and text-only data to construct the latent action space.
arXiv Detail & Related papers (2026-01-12T13:13:24Z) - Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning [6.549601823162279]
Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP)<n>We explore several adaptation strategies for pre-trained, decoder-only LLMs.
arXiv Detail & Related papers (2025-07-30T14:49:30Z) - Robust Detection of LLM-Generated Text: A Comparative Analysis [0.276240219662896]
Large language models can be widely integrated into many aspects of life, and their output can quickly fill all network resources.
It becomes increasingly important to develop powerful detectors for the generated text.
This detector is essential to prevent the potential misuse of these technologies and to protect areas such as social media from the negative effects.
arXiv Detail & Related papers (2024-11-09T18:27:15Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Measuring Distributional Shifts in Text: The Advantage of Language
Model-Based Embeddings [11.393822909537796]
An essential part of monitoring machine learning models in production is measuring input and output data drift.
Recent advancements in large language models (LLMs) indicate their effectiveness in capturing semantic relationships.
We propose a clustering-based algorithm for measuring distributional shifts in text data by exploiting such embeddings.
arXiv Detail & Related papers (2023-12-04T20:46:48Z) - Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z) - Composable Text Controls in Latent Space with ODEs [97.12426987887021]
This paper proposes a new efficient approach for composable text operations in the compact latent space of text.
By connecting pretrained LMs to the latent space through efficient adaption, we then decode the sampled vectors into desired text sequences.
Experiments show that composing those operators within our approach manages to generate or edit high-quality text.
arXiv Detail & Related papers (2022-08-01T06:51:45Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - DeCLUTR: Deep Contrastive Learning for Unsupervised Textual
Representations [4.36561468436181]
We present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations.
Our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders.
Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.
arXiv Detail & Related papers (2020-06-05T20:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.