Audio Summarization with Audio Features and Probability Distribution
Divergence
- URL: http://arxiv.org/abs/2001.07098v2
- Date: Thu, 2 Apr 2020 09:28:02 GMT
- Title: Audio Summarization with Audio Features and Probability Distribution
Divergence
- Authors: Carlos-Emiliano Gonz\'alez-Gallardo, Romain Deveaud, Eric SanJuan, and
Juan-Manuel Torres-Moreno
- Abstract summary: We focus on audio summarization based on audio features and the probability of distribution divergence.
Our method, based on an extractive summarization approach, aims to select the most relevant segments until a time threshold is reached.
- Score: 1.0587107940165885
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The automatic summarization of multimedia sources is an important task that
facilitates the understanding of an individual by condensing the source while
maintaining relevant information. In this paper we focus on audio summarization
based on audio features and the probability of distribution divergence. Our
method, based on an extractive summarization approach, aims to select the most
relevant segments until a time threshold is reached. It takes into account the
segment's length, position and informativeness value. Informativeness of each
segment is obtained by mapping a set of audio features issued from its
Mel-frequency Cepstral Coefficients and their corresponding Jensen-Shannon
divergence score. Results over a multi-evaluator scheme shows that our approach
provides understandable and informative summaries.
Related papers
- Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.
We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.
This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z) - Multimodal Variational Auto-encoder based Audio-Visual Segmentation [46.67599800471001]
ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation.
Our approach leads to a new state-of-the-art for audio-visual segmentation, with a 3.84 mIOU performance leap.
arXiv Detail & Related papers (2023-10-12T13:09:40Z) - LLM Based Multi-Document Summarization Exploiting Main-Event Biased
Monotone Submodular Content Extraction [42.171703872560286]
Multi-document summarization is a challenging task due to its inherent subjective bias.
We aim to enhance the objectivity of news summarization by focusing on the main event of a group of related news documents.
arXiv Detail & Related papers (2023-10-05T09:38:09Z) - Audio-Visual Speaker Verification via Joint Cross-Attention [4.229744884478575]
Cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification.
We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification.
arXiv Detail & Related papers (2023-09-28T16:25:29Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Multi-Modal Perception Attention Network with Self-Supervised Learning
for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities.
MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.