Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning
- URL: http://arxiv.org/abs/2204.05148v2
- Date: Sat, 21 Oct 2023 10:15:36 GMT
- Title: Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning
- Authors: Robin Algayres, Adel Nabli, Benoit Sagot, Emmanuel Dupoux
- Abstract summary: We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective.
We show that when built on top of recent self-supervised audio representations, this method can be applied iteratively and yield competitive SSE.
- Score: 15.729812221628382
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a simple neural encoder architecture that can be trained using
an unsupervised contrastive learning objective which gets its positive samples
from data-augmented k-Nearest Neighbors search. We show that when built on top
of recent self-supervised audio representations, this method can be applied
iteratively and yield competitive SSE as evaluated on two tasks:
query-by-example of random sequences of speech, and spoken term discovery. On
both tasks our method pushes the state-of-the-art by a significant margin
across 5 different languages. Finally, we establish a benchmark on a
query-by-example task on the LibriSpeech dataset to monitor future improvements
in the field.
Related papers
- On the Noise Robustness of In-Context Learning for Text Generation [41.59602454113563]
In this work, we show that, on text generation tasks, noisy annotations significantly hurt the performance of in-context learning.
To circumvent the issue, we propose a simple and effective approach called Local Perplexity Ranking (LPR)
LPR replaces the "noisy" candidates with their nearest neighbors that are more likely to be clean.
arXiv Detail & Related papers (2024-05-27T15:22:58Z) - Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps.
We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Learning Decoupled Retrieval Representation for Nearest Neighbour Neural
Machine Translation [16.558519886325623]
kNN-MT successfully incorporates external corpus by retrieving word-level representations at test time.
In this work, we highlight that coupling the representations of these two tasks is sub-optimal for fine-grained retrieval.
We leverage supervised contrastive learning to learn the distinctive retrieval representation derived from the original context representation.
arXiv Detail & Related papers (2022-09-19T03:19:38Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech [44.68649535280397]
We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
arXiv Detail & Related papers (2021-11-19T18:59:23Z) - RETRONLU: Retrieval Augmented Task-Oriented Semantic Parsing [11.157958012672202]
We are applying retrieval-based modeling ideas to the problem of multi-domain task-oriented semantic parsing.
Our approach, RetroNLU, extends a sequence-to-sequence model architecture with a retrieval component.
We analyze the nearest neighbor retrieval component's quality, model sensitivity and break down the performance for semantic parses of different utterance complexity.
arXiv Detail & Related papers (2021-09-21T19:30:30Z) - Learning to Ask Conversational Questions by Optimizing Levenshtein
Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions.
RISE is able to pay attention to tokens that are related to conversational characteristics.
Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z) - Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on
Spoken Language Understanding [101.24748444126982]
Decomposable tasks are complex and comprise of a hierarchy of sub-tasks.
Existing benchmarks, however, typically hold out examples for only the surface-level sub-task.
We propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions.
arXiv Detail & Related papers (2021-06-29T02:53:59Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.