Context-aware Fine-tuning of Self-supervised Speech Models
- URL: http://arxiv.org/abs/2212.08542v2
- Date: Tue, 28 Mar 2023 21:20:11 GMT
- Title: Context-aware Fine-tuning of Self-supervised Speech Models
- Authors: Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu,
Shinji Watanabe
- Abstract summary: We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
- Score: 56.95389222319555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pre-trained transformers have improved the state of the art
on a variety of speech tasks. Due to the quadratic time and space complexity of
self-attention, they usually operate at the level of relatively short (e.g.,
utterance) segments. In this paper, we study the use of context, i.e.,
surrounding segments, during fine-tuning and propose a new approach called
context-aware fine-tuning. We attach a context module on top of the last layer
of a pre-trained model to encode the whole segment into a context embedding
vector which is then used as an additional feature for the final prediction.
During the fine-tuning stage, we introduce an auxiliary loss that encourages
this context embedding vector to be similar to context vectors of surrounding
segments. This allows the model to make predictions without access to these
surrounding segments at inference time and requires only a tiny overhead
compared to standard fine-tuned models. We evaluate the proposed approach using
the SLUE and Libri-light benchmarks for several downstream tasks: Automatic
speech recognition (ASR), named entity recognition (NER), and sentiment
analysis (SA). The results show that context-aware fine-tuning not only
outperforms a standard fine-tuning baseline but also rivals a strong context
injection baseline that uses neighboring speech segments during inference.
Related papers
- Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - Open-Vocabulary Segmentation with Semantic-Assisted Calibration [73.39366775301382]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.
We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z) - Coherent Entity Disambiguation via Modeling Topic and Categorical
Dependency [87.16283281290053]
Previous entity disambiguation (ED) methods adopt a discriminative paradigm, where prediction is made based on matching scores between mention context and candidate entities.
We propose CoherentED, an ED system equipped with novel designs aimed at enhancing the coherence of entity predictions.
We achieve new state-of-the-art results on popular ED benchmarks, with an average improvement of 1.3 F1 points.
arXiv Detail & Related papers (2023-11-06T16:40:13Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Internal language model estimation through explicit context vector
learning for attention-based encoder-decoder ASR [19.233720469733797]
We propose two novel approaches to estimate the biased ILM based on Listen-Attend-Spell (LAS) models.
Experiments show that the ILMs estimated by the proposed methods achieve the lowest perplexity.
arXiv Detail & Related papers (2022-01-26T07:47:27Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings [32.59716743279858]
We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
arXiv Detail & Related papers (2020-12-14T10:17:25Z) - Contextualized Translation of Automatically Segmented Speech [20.334746967390164]
We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
arXiv Detail & Related papers (2020-08-05T17:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.