JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention
- URL: http://arxiv.org/abs/2512.07168v1
- Date: Mon, 08 Dec 2025 05:01:51 GMT
- Title: JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention
- Authors: Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang, Ravid Shwartz-Ziv, Yann LeCun,
- Abstract summary: We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM)<n>Stage1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction.<n>Stage2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder.
- Score: 47.304088800992474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.
Related papers
- GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model [35.12859489567766]
We present GenTSE, a two-stage decoder-only generative LM approach for TSE.<n> separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech.<n>Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.
arXiv Detail & Related papers (2025-12-24T06:13:02Z) - AUREXA-SE: Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement [14.435791985655117]
We propose AUREXA-SE (Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement)<n> AUREXA-SE jointly leverages raw audio waveforms and visual cues by employing a U-Net-based 1D convolutional encoder for audio.<n>Central to the architecture is a novel bidirectional cross-attention mechanism, which facilitates deep contextual fusion between modalities.
arXiv Detail & Related papers (2025-10-06T19:05:35Z) - DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding [12.05169114091718]
DiffSoundStream is a solution that improves the efficiency of speech tokenization in non-streaming scenarios.<n> Experiments show that at 50 tokens per second, DiffSoundStream achieves speech quality on par with a standard SoundStream model.
arXiv Detail & Related papers (2025-06-27T16:23:07Z) - A-JEPA: Joint-Embedding Predictive Architecture Can Listen [35.308323314848735]
We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum.
A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations.
arXiv Detail & Related papers (2023-11-27T13:53:53Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - A Conformer Based Acoustic Model for Robust Automatic Speech Recognition [63.242128956046024]
The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
arXiv Detail & Related papers (2022-03-01T20:17:31Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.