On the Audio Hallucinations in Large Audio-Video Language Models
- URL: http://arxiv.org/abs/2401.09774v1
- Date: Thu, 18 Jan 2024 07:50:07 GMT
- Title: On the Audio Hallucinations in Large Audio-Video Language Models
- Authors: Taichi Nishimura and Shota Nakada and Masayoshi Kondo
- Abstract summary: This paper refers to audio hallucinations and analyzes them in large audio-video language models.
We gather 1,000 sentences by inquiring about audio information and annotate them whether they contain hallucinations.
We tackle a task of audio hallucination classification using pre-trained audio-text models in the zero-shot and fine-tuning settings.
- Score: 2.303098021872002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large audio-video language models can generate descriptions for both video
and audio. However, they sometimes ignore audio content, producing audio
descriptions solely reliant on visual information. This paper refers to this as
audio hallucinations and analyzes them in large audio-video language models. We
gather 1,000 sentences by inquiring about audio information and annotate them
whether they contain hallucinations. If a sentence is hallucinated, we also
categorize the type of hallucination. The results reveal that 332 sentences are
hallucinated with distinct trends observed in nouns and verbs for each
hallucination type. Based on this, we tackle a task of audio hallucination
classification using pre-trained audio-text models in the zero-shot and
fine-tuning settings. Our experimental results reveal that the zero-shot models
achieve higher performance (52.2% in F1) than the random (40.3%) and the
fine-tuning models achieve 87.9%, outperforming the zero-shot models.
Related papers
- Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio [15.878350948461646]
We investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference.
By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently.
We then study hallucinations caused by the augmentation of speech with such sounds.
arXiv Detail & Related papers (2025-01-20T10:14:52Z) - Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models [65.32990889402927]
We coin this phenomenon as knowledge overshadowing''
We show that the hallucination rate grows with both the imbalance ratio and the length of dominant condition description.
We propose to utilize overshadowing conditions as a signal to catch hallucination before it is produced.
arXiv Detail & Related papers (2024-07-10T20:37:42Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - On Large Language Models' Hallucination with Regard to Known Facts [74.96789694959894]
Large language models are successful in answering factoid questions but are also prone to hallucination.
We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics.
Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.
arXiv Detail & Related papers (2024-03-29T06:48:30Z) - Hallucinations in Neural Automatic Speech Recognition: Identifying
Errors and Hallucinatory Models [11.492702369437785]
Hallucinations are semantically unrelated to the source utterance, yet still fluent and coherent.
We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models.
We devise a framework for identifying hallucinations by analysing their semantic connection with the ground truth and their fluency.
arXiv Detail & Related papers (2024-01-03T06:56:56Z) - Evaluating Hallucinations in Chinese Large Language Models [65.4771562909392]
We establish a benchmark named HalluQA (Chinese Hallucination Question-Answering) to measure the hallucination phenomenon in Chinese large language models.
We consider two types of hallucinations: imitative falsehoods and factual errors, and we construct adversarial samples based on GLM-130B and ChatGPT.
For evaluation, we design an automated evaluation method using GPT-4 to judge whether a model output is hallucinated.
arXiv Detail & Related papers (2023-10-05T07:57:09Z) - Reducing Hallucinations in Neural Machine Translation with Feature
Attribution [54.46113444757899]
We present a case study focusing on model understanding and regularisation to reduce hallucinations in NMT.
We first use feature attribution methods to study the behaviour of an NMT model that produces hallucinations.
We then leverage these methods to propose a novel loss function that substantially helps reduce hallucinations and does not require retraining the model from scratch.
arXiv Detail & Related papers (2022-11-17T20:33:56Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z) - On the Origin of Hallucinations in Conversational Models: Is it the
Datasets or the Models? [32.41234580068662]
We conduct a study on existing knowledge-grounded conversational benchmarks and several state-of-the-art models.
Standard benchmarks consist of >60% hallucinated responses, leading to models that not only hallucinate but even amplify hallucinations.
Our findings raise important questions on the quality of existing datasets and models trained using them.
arXiv Detail & Related papers (2022-04-17T05:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.