MATE: Matryoshka Audio-Text Embeddings for Open-Vocabulary Keyword Spotting
- URL: http://arxiv.org/abs/2601.14012v1
- Date: Tue, 20 Jan 2026 14:30:40 GMT
- Title: MATE: Matryoshka Audio-Text Embeddings for Open-Vocabulary Keyword Spotting
- Authors: Youngmoon Jung, Myunghun Jung, Joon-Young Yang, Yong-Hyeok Lee, Jaeyoung Roh, Hoon-Young Cho,
- Abstract summary: Matryoshka Audio-Text Embeddings (MATE) is a dual-encoder framework that encodes multiple granularities within a single vector via nested sub-embeddinges.<n>MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic.<n>This is the first application of matryoshka-style embeddings to KWS, achieving state-of-the-art results on WSJ and LibriPhrase without any inference overhead.
- Score: 15.033299024460463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary keyword spotting (KWS) with text-based enrollment has emerged as a flexible alternative to fixed-phrase triggers. Prior utterance-level matching methods, from an embedding-learning standpoint, learn embeddings at a single fixed dimensionality. We depart from this design and propose Matryoshka Audio-Text Embeddings (MATE), a dual-encoder framework that encodes multiple embedding granularities within a single vector via nested sub-embeddings ("prefixes"). Specifically, we introduce a PCA-guided prefix alignment: PCA-compressed versions of the full text embedding for each prefix size serve as teacher targets to align both audio and text prefixes. This alignment concentrates salient keyword cues in lower-dimensional prefixes, while higher dimensions add detail. MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic. To our knowledge, this is the first application of matryoshka-style embeddings to KWS, achieving state-of-the-art results on WSJ and LibriPhrase without any inference overhead.
Related papers
- Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting [8.401528952094413]
For text enrollment-based open-vocabulary keyword spotting (KWS), acoustic and text embeddings are typically compared at either the phoneme or utterance level.<n>We optimize acoustic and text encoders using deep metric learning (DML), enabling direct comparison of multi-modal embeddings in a shared embedding space.<n>We propose Modality Adversarial Learning (MAL), which reduces the domain gap in heterogeneous modality representations.
arXiv Detail & Related papers (2025-05-22T14:49:46Z) - Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)<n>AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z) - LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting [6.856101216726412]
This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment.
For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC)
We then aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with the text embedding (TE) of the target keyword text.
arXiv Detail & Related papers (2024-06-12T06:44:40Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Aligning Speakers: Evaluating and Visualizing Text-based Diarization
Using Efficient Multiple Sequence Alignment (Extended Version) [21.325463387256807]
Two new metrics are proposed, Text-based Diarization Error Rate and Diarization F1, which perform utterance- and word-level evaluations.
Our metrics encompass more types of errors compared to existing ones, allowing us to make a more comprehensive analysis in speaker diarization.
arXiv Detail & Related papers (2023-09-14T12:43:26Z) - Learning-to-Rank Meets Language: Boosting Language-Driven Ordering
Alignment for Ordinal Classification [60.28913031192201]
We present a novel language-driven ordering alignment method for ordinal classification.
Recent developments in pre-trained vision-language models inspire us to leverage the rich ordinal priors in human language.
Experiments on three ordinal classification tasks, including facial age estimation, historical color image (HCI) classification, and aesthetic assessment demonstrate its promising performance.
arXiv Detail & Related papers (2023-06-24T04:11:31Z) - Matching Latent Encoding for Audio-Text based Keyword Spotting [9.599402723927733]
We propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS)
Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence.
Experimental results show that our DSP is more effective than other partitioning schemes.
arXiv Detail & Related papers (2023-06-08T14:44:23Z) - Aligning Bag of Regions for Open-Vocabulary Object Detection [74.89762864838042]
We propose to align the embedding of bag of regions beyond individual regions.
The proposed method groups contextually interrelated regions as a bag.
Our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks.
arXiv Detail & Related papers (2023-02-27T17:39:21Z) - Knowing Where and What: Unified Word Block Pretraining for Document
Understanding [11.46378901674016]
We propose UTel, a language model with Unified TExt and layout pre-training.
Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks.
In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way.
arXiv Detail & Related papers (2022-07-28T09:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.