Related papers: Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

URL: http://arxiv.org/abs/2406.05806v4
Date: Mon, 16 Sep 2024 16:26:49 GMT
Title: Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper
Authors: Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee,
Abstract summary: This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages.
Score: 51.12146889808824
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. We compare its performances when prompted by prompts with correct information and those corrupted with incorrect information. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages despite the mismatch with pre-training scenarios. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones. In sum, We raise insightful questions about Whisper's prompt understanding and reveal its counter-intuitive behaviors. We encourage further studies.

Related papers

Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization [2.403252956256118]
This paper explores two novel approaches to enhance Whisper's multilingual speech recognition performance in Indian languages. First, we propose prompt-tuning with language family information, which enhances Whisper's accuracy in linguistically similar languages. Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper's inference speed.
arXiv Detail & Related papers (2024-12-27T18:32:24Z)
Dissecting Paraphrases: The Impact of Prompt Syntax and supplementary Information on Knowledge Retrieval from Pretrained Language Models [8.588056811772693]
ConPARE-LAMA is a probe consisting of 34 million distinct prompts that facilitate comparison across minimal paraphrases. ConPARE-LAMA enables insights into the independent impact of either syntactical form or semantic information of paraphrases on the knowledge retrieval performance of PLMs.
arXiv Detail & Related papers (2024-04-02T14:35:08Z)
Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST! [4.1970767174840455]
We study whether pre-trained language models (LMs) correctly identify and interpret underspecified sentences. Our experiments show that when interpreting underspecified sentences, LMs exhibit little uncertainty, contrary to what theoretical accounts of underspecification would predict.
arXiv Detail & Related papers (2024-02-19T19:49:29Z)
Enhancing expressivity transfer in textless speech-to-speech translation [0.0]
Existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages. This study presents a novel method that operates at the discrete speech unit level and leverages multilingual emotion embeddings. We demonstrate how these embeddings can be used to effectively predict the pitch and duration of speech units in the target language.
arXiv Detail & Related papers (2023-10-11T08:07:22Z)
Acoustic and linguistic representations for speech continuous emotion recognition in call center conversations [2.0653090022137697]
We explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus. Our experiments confirm the large gain in performance obtained with the use of pre-trained features. Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction.
arXiv Detail & Related papers (2023-10-06T10:22:51Z)
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization [61.60501633397704]
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts. Experiments show that our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets.
arXiv Detail & Related papers (2023-05-18T16:32:58Z)
Context-faithful Prompting for Large Language Models [51.194410884263135]
Large language models (LLMs) encode parametric knowledge about world facts. Their reliance on parametric knowledge may cause them to overlook contextual cues, leading to incorrect predictions in context-sensitive NLP tasks. We assess and enhance LLMs' contextual faithfulness in two aspects: knowledge conflict and prediction with abstention.
arXiv Detail & Related papers (2023-03-20T17:54:58Z)
Prompting Large Language Model for Machine Translation: A Case Study [87.88120385000666]
We offer a systematic study on prompting strategies for machine translation. We examine factors for prompt template and demonstration example selection. We explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning.
arXiv Detail & Related papers (2023-01-17T18:32:06Z)
What BERT Based Language Models Learn in Spoken Transcripts: An Empirical Study [6.696983725360809]
Language Models (LMs) have been ubiquitously leveraged in various tasks including spoken language understanding (SLU) In this work, we propose to dissect SLU into three representative properties: speakeral (disfluency, pause, overtalk), channel (conversation-type, turn-tasks) and ASR (insertion, deletion,substitution) We probe BERT based language models (BERT, RoBERTa) trained on spoken transcripts to investigate its ability to understand multifarious properties in absence of any speech cues.
arXiv Detail & Related papers (2021-09-19T11:23:50Z)
Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance. Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z)
On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.