inversedMixup: Data Augmentation via Inverting Mixed Embeddings
- URL: http://arxiv.org/abs/2601.21543v1
- Date: Thu, 29 Jan 2026 11:00:50 GMT
- Title: inversedMixup: Data Augmentation via Inverting Mixed Embeddings
- Authors: Fanshuang Kong, Richong Zhang, Qiyu Sun, Zhijie Nie, Ting Deng, Chunming Hu,
- Abstract summary: Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio.<n>InversedMixup combines the controllability of Mixup with the interpretability of LLM-based generation.
- Score: 45.12897360336728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.
Related papers
- Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference [58.189320101488725]
DLLMs promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding.<n>We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency.<n>We propose ReMix, a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state.
arXiv Detail & Related papers (2026-02-26T11:08:11Z) - Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment [84.74716380180428]
We propose AutoRegEmbed, a contrastive learning method built on embedding conditional probability distributions.<n>We show that our method significantly outperforms traditional contrastive learning approaches.
arXiv Detail & Related papers (2025-02-17T03:36:25Z) - KL-geodesics flow matching with a novel sampling scheme [4.347494885647007]
Non-autoregressive language models generate all tokens simultaneously, offering potential speed advantages over traditional autoregressive models.<n>We investigate a conditional flow matching approach for text generation.
arXiv Detail & Related papers (2024-11-25T17:15:41Z) - Bridging the Gap between Different Vocabularies for LLM Ensemble [10.669552498083709]
vocabulary discrepancies among various large language models (LLMs) have constrained previous studies.
We propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA)
EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step.
arXiv Detail & Related papers (2024-04-15T06:28:20Z) - Fast Semisupervised Unmixing Using Nonconvex Optimization [80.11512905623417]
We introduce a novel convex convex model for semi/library-based unmixing.
We demonstrate the efficacy of Alternating Methods of sparse unsupervised unmixing.
arXiv Detail & Related papers (2024-01-23T10:07:41Z) - ShuffleMix: Improving Representations via Channel-Wise Shuffle of
Interpolated Hidden States [41.628854241259226]
This paper introduces a novel concept of ShuffleMix -- Shuffle of Mixed hidden features.
Our ShuffleMix method favors a simple linear shuffle randomly selected feature channels for feature mixup in-between supervision.
Compared to its direct competitor, the proposed ShuffleMix can gain superior generalization.
arXiv Detail & Related papers (2023-05-30T01:53:34Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z) - Modifications of FastICA in Convolutive Blind Source Separation [5.770800671793959]
Convolutive blind source separation (BSS) is intended to recover the unknown components from their convolutive mixtures.
The spatial-temporal prewhitening stage and the para-unitary filters constraint are difficult to implement in a convolutive context.
We propose several modifications of FastICA to alleviate these difficulties.
arXiv Detail & Related papers (2021-07-24T13:29:55Z) - SMILE: Self-Distilled MIxup for Efficient Transfer LEarning [42.59451803498095]
In this work, we propose SMILE - Self-Distilled Mixup for EffIcient Transfer LEarning.
With mixed images as inputs, SMILE regularizes the outputs of CNN feature extractors to learn from the mixed feature vectors of inputs.
The triple regularizer balances the mixup effects in both feature and label spaces while bounding the linearity in-between samples for pre-training tasks.
arXiv Detail & Related papers (2021-03-25T16:02:21Z) - Modal Regression based Structured Low-rank Matrix Recovery for
Multi-view Learning [70.57193072829288]
Low-rank Multi-view Subspace Learning has shown great potential in cross-view classification in recent years.
Existing LMvSL based methods are incapable of well handling view discrepancy and discriminancy simultaneously.
We propose Structured Low-rank Matrix Recovery (SLMR), a unique method of effectively removing view discrepancy and improving discriminancy.
arXiv Detail & Related papers (2020-03-22T03:57:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.