Related papers: A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)

A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)

URL: http://arxiv.org/abs/2407.03110v1
Date: Thu, 2 May 2024 07:34:31 GMT
Title: A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)
Authors: Lam Pham, Phat Lam, Tin Nguyen, Hieu Tang, Alexander Schindler,
Abstract summary: We present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. By combining individual tasks and analyzing both audio & visual data extracted from input video, the toolchain offers various audio/video-based applications.
Score: 40.20142677441689
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By combining individual tasks and analyzing both audio \& visual data extracted from input video, the toolchain offers various audio/video-based applications: Two general applications of audio/video clustering, comprehensive audio/video summary and a specific application of riot or violent context detection. Furthermore, the toolchain presents a flexible and adaptable architecture that is effective to integrate new models for further audio/video-based applications.

Related papers

Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval [58.640807985155554]
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to a given query.<n>Most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality.<n>We propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR.
arXiv Detail & Related papers (2025-08-06T09:58:43Z)
Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models [13.63552417613795]
We propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models.<n>Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations.
arXiv Detail & Related papers (2025-06-06T21:06:35Z)
Audio-centric Video Understanding Benchmark without Text Shortcut [49.01648001666229]
Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs)<n>This paper proposes an audio-centric video understanding benchmark (AVUT) to evaluate the video comprehension capabilities of multimodal LLMs.
arXiv Detail & Related papers (2025-03-25T16:28:24Z)
Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup [2.80888070977859]
We propose audio-visual SSL for video action recognition, which uses both visual and audio together. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed framework.
arXiv Detail & Related papers (2025-03-04T05:13:56Z)
CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection [2.110168344647122]
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech. We introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. Our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
arXiv Detail & Related papers (2024-10-18T14:43:34Z)
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models [27.54879344983513]
Video-SALMONN can understand not only visual frame sequences, audio events and music, but speech as well. Video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs.
arXiv Detail & Related papers (2024-06-22T01:36:11Z)
UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization [83.89550658314741]
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL) We present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time.
arXiv Detail & Related papers (2024-04-04T03:28:57Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z)
Estimating Visual Information From Audio Through Manifold Learning [14.113590443352495]
We propose a new framework for extracting visual information about a scene only using audio signals. Our framework is based on Manifold Learning and consists of two steps. We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset.
arXiv Detail & Related papers (2022-08-03T20:47:11Z)
AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information. We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)
Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.