Improving Audio Caption Fluency with Automatic Error Correction
- URL: http://arxiv.org/abs/2306.10090v1
- Date: Fri, 16 Jun 2023 13:37:01 GMT
- Title: Improving Audio Caption Fluency with Automatic Error Correction
- Authors: Hanxue Zhang, Zeyu Xie, Xuenan Xu, Mengyue Wu, Kai Yu
- Abstract summary: We propose a new task of AAC error correction for post-processing AAC outputs.
We use observation-based rules to corrupt captions without errors, for pseudo grammatically-erroneous sentence generation.
We train a neural network-based model on the synthetic error dataset and apply the model to correct real errors in AAC outputs.
- Score: 23.157732462075547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated audio captioning (AAC) is an important cross-modality translation
task, aiming at generating descriptions for audio clips. However, captions
generated by previous AAC models have faced ``false-repetition'' errors due to
the training objective. In such scenarios, we propose a new task of AAC error
correction and hope to reduce such errors by post-processing AAC outputs. To
tackle this problem, we use observation-based rules to corrupt captions without
errors, for pseudo grammatically-erroneous sentence generation. One pair of
corrupted and clean sentences can thus be used for training. We train a neural
network-based model on the synthetic error dataset and apply the model to
correct real errors in AAC outputs. Results on two benchmark datasets indicate
that our approach significantly improves fluency while maintaining semantic
information.
Related papers
- Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Parameter-tuning-free data entry error unlearning with adaptive
selective synaptic dampening [51.34904967046097]
We introduce an extension to the selective synaptic dampening unlearning method that removes the need for parameter tuning.
We demonstrate the performance of this extension, adaptive selective synaptic dampening (ASSD) on various ResNet18 and Vision Transformer unlearning tasks.
The application of this approach is particularly compelling in industrial settings, such as supply chain management.
arXiv Detail & Related papers (2024-02-06T14:04:31Z) - Optimized Tokenization for Transcribed Error Correction [10.297878672883973]
We show that the performance of correction models can be significantly increased by training solely using synthetic data.
Specifically, we show that synthetic data generated using the error distribution derived from a set of transcribed data outperforms the common approach of applying random perturbations.
arXiv Detail & Related papers (2023-10-16T12:14:21Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - ASR Error Detection via Audio-Transcript entailment [1.3750624267664155]
We propose an end-to-end approach for ASR error detection using audio-transcript entailment.
The proposed model utilizes an acoustic encoder and a linguistic encoder to model the speech and transcript respectively.
Our proposed model achieves classification error rates (CER) of 26.2% on all transcription errors and 23% on medical errors specifically, leading to improvements upon a strong baseline by 12% and 15.4%, respectively.
arXiv Detail & Related papers (2022-07-22T02:47:15Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - CL4AC: A Contrastive Loss for Audio Captioning [43.83939284740561]
We propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC)
In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts.
Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2021-07-21T10:13:02Z) - Empirical Error Modeling Improves Robustness of Noisy Neural Sequence
Labeling [26.27504889360246]
We propose an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text.
To overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings.
Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets.
arXiv Detail & Related papers (2021-05-25T12:15:45Z) - A Self-Refinement Strategy for Noise Reduction in Grammatical Error
Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets.
There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected.
We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.