Related papers: Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures

Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures

URL: http://arxiv.org/abs/2602.09040v1
Date: Fri, 30 Jan 2026 20:51:37 GMT
Title: Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures
Authors: Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv,
Abstract summary: Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self-supervised speech representation learning, but suffer from representation collapse without explicit grounding.<n>We propose GMM-Anchored JEPA, which fits a Gaussian Mixture Model once on log-mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training.<n>On 50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM-style
Score: 45.74430728311433
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self-supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM-Anchored JEPA, which fits a Gaussian Mixture Model once on log-mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re-clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM-style baseline with matched compute. Cluster analysis shows GMM-anchored representations achieve up to 98% entropy compared to 31% for WavLM-style, indicating substantially more uniform cluster utilization. Code is made available at https://github.com/gioannides/clustering-anchored-jepa.

Related papers

Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models [0.0]
Gated Sparse Attention (GSA) is an architecture that realizes the benefits of both sparse and gated attention.<n>GSA incorporates a gated lightning indexer with sigmoid activations that produce bounded, interpretable selection scores.
arXiv Detail & Related papers (2026-01-12T20:33:39Z)
Fortytwo: Swarm Inference with Peer-Ranked Consensus [36.94429692322632]
We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference.<n>Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting.
arXiv Detail & Related papers (2025-10-27T23:19:48Z)
An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
GAIA: A Foundation Model for Operational Atmospheric Dynamics [0.83442357861662]
We introduce GAIA, a hybrid self-supervised model that fuses Masked Autoencoders (MAE) with self-distillation with no labels (DINO)<n>GAIA learns disentangled representations that capture atmospheric dynamics rather than trivial diurnal patterns.<n>When transferred to downstream tasks, GAIA consistently outperforms an MAE-only baseline.
arXiv Detail & Related papers (2025-05-15T05:07:09Z)
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning [15.61345581743979]
We present Skywork R1V2, a next-generation multimodal reasoning model.<n>At its core, R1V2 introduces a hybrid reinforcement learning paradigm.
arXiv Detail & Related papers (2025-04-23T12:24:10Z)
Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play [52.3079697845254]
eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training.<n>We show eva can create effective RL curricula and is robust across ablations.
arXiv Detail & Related papers (2024-10-31T08:15:32Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)
Advancing Vision Transformers with Group-Mix Attention [59.585623293856735]
Group-Mix Attention (GMA) is an advanced replacement for traditional self-attention. GMA simultaneously captures token-to-token, token-to-group, and group-to-group correlations with various group sizes. GroupMixFormer achieves state-of-the-art performance in image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-11-26T01:25:03Z)
Accurate Molecular-Orbital-Based Machine Learning Energies via Unsupervised Clustering of Chemical Space [0.0]
We introduce an unsupervised clustering algorithm to improve training efficiency and accuracy in predicting energies using molecular-orbital-based machine learning (MOB-ML) This work determines clusters via the Gaussian mixture model (GMM) in an entirely automatic manner.
arXiv Detail & Related papers (2022-04-21T00:56:16Z)
Tight integration of neural- and clustering-based diarization through deep unfolding of infinite Gaussian mixture model [84.57667267657382]
This paper introduces a it trainable clustering algorithm into the integration framework. Speaker embeddings are optimized during training such that it better fits iGMM clustering. Experimental results show that the proposed approach outperforms the conventional approach in terms of diarization error rate.
arXiv Detail & Related papers (2022-02-14T07:45:21Z)
Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models. We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs. Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.