Multimodal Chaptering for Long-Form TV Newscast Video
- URL: http://arxiv.org/abs/2406.17590v1
- Date: Wed, 20 Mar 2024 08:39:41 GMT
- Title: Multimodal Chaptering for Long-Form TV Newscast Video
- Authors: Khalil Guetari, Yannis Tevissen, Frédéric Petitpont,
- Abstract summary: Our method integrates both audio and visual cues through a two-stage process involving frozen neural networks and a trained LSTM network.
Our proposed model has been evaluated on a diverse dataset comprising over 500 TV newscast videos of an average of 41 minutes gathered from TF1, a French TV channel.
Experimental results demonstrate that this innovative fusion strategy achieves state of the art performance, yielding a high precision rate of 82% at IoU of 90%.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel approach for automatic chaptering of TV newscast videos, addressing the challenge of structuring and organizing large collections of unsegmented broadcast content. Our method integrates both audio and visual cues through a two-stage process involving frozen neural networks and a trained LSTM network. The first stage extracts essential features from separate modalities, while the LSTM effectively fuses these features to generate accurate segment boundaries. Our proposed model has been evaluated on a diverse dataset comprising over 500 TV newscast videos of an average of 41 minutes gathered from TF1, a French TV channel, with varying lengths and topics. Experimental results demonstrate that this innovative fusion strategy achieves state of the art performance, yielding a high precision rate of 82% at IoU of 90%. Consequently, this approach significantly enhances analysis, indexing and storage capabilities for TV newscast archives, paving the way towards efficient management and utilization of vast audiovisual resources.
Related papers
- The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024 [27.30100635072298]
TAL focuses on identifying and classifying actions within specific time intervals throughout a video sequence.
We employ a data augmentation technique by expanding the training dataset using overlapping labels from the Something-SomethingV2 dataset.
For feature extraction, we utilize state-of-the-art models, including UMT, VideoMAEv2 for video features, and BEATs and CAV-MAE for audio features.
arXiv Detail & Related papers (2024-10-08T01:07:21Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS
Instance Segmentation [10.789826145990016]
This paper presents a deep learning framework for medical video segmentation.
Our framework explicitly extracts features from neighbouring frames across the temporal dimension.
It incorporates them with a temporal feature blender, which then tokenises the high-level-temporal feature to form a strong global feature encoded via a Swin Transformer.
arXiv Detail & Related papers (2023-02-22T12:09:39Z) - M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval [34.343617836027725]
We propose a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and each modality content in videos.
Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.
arXiv Detail & Related papers (2022-08-16T10:51:37Z) - Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel
Transformer [29.03463312813923]
Video denoising aims to recover high-quality frames from the noisy video.
Most existing approaches adopt convolutional neural networks(CNNs) to separate the noise from the original visual content.
We propose a Dual-stage Spatial-Channel Transformer (DSCT) for coarse-to-fine video denoising.
arXiv Detail & Related papers (2022-04-30T09:01:21Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z) - Joint Multimedia Event Extraction from Video and Article [51.159034070824056]
We propose the first approach to jointly extract events from video and text articles.
First, we propose the first self-supervised multimodal event coreference model.
Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents.
arXiv Detail & Related papers (2021-09-27T03:22:12Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - Cross-media Structured Common Space for Multimedia Event Extraction [82.36301617438268]
We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents.
We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information into a common embedding space.
By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.
arXiv Detail & Related papers (2020-05-05T20:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.