Related papers: Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation

Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation

URL: http://arxiv.org/abs/2503.21848v1
Date: Thu, 27 Mar 2025 16:42:50 GMT
Title: Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation
Authors: Jonathan Attard, Dylan Seychell,
Abstract summary: This paper presents a comparative analysis of image, video, and audio classifiers for automated news video segmentation.<n>Image-based classifiers achieve superior performance (84.34% accuracy) compared to more complex temporal models.<n> Binary classification models achieved high accuracy for transitions (94.23%) and advertisements (92.74%)
Score: 0.09208007322096533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: News videos require efficient content organisation and retrieval systems, but their unstructured nature poses significant challenges for automated processing. This paper presents a comprehensive comparative analysis of image, video, and audio classifiers for automated news video segmentation. This work presents the development and evaluation of multiple deep learning approaches, including ResNet, ViViT, AST, and multimodal architectures, to classify five distinct segment types: advertisements, stories, studio scenes, transitions, and visualisations. Using a custom-annotated dataset of 41 news videos comprising 1,832 scene clips, our experiments demonstrate that image-based classifiers achieve superior performance (84.34\% accuracy) compared to more complex temporal models. Notably, the ResNet architecture outperformed state-of-the-art video classifiers while requiring significantly fewer computational resources. Binary classification models achieved high accuracy for transitions (94.23\%) and advertisements (92.74\%). These findings advance the understanding of effective architectures for news video segmentation and provide practical insights for implementing automated content organisation systems in media applications. These include media archiving, personalised content delivery, and intelligent video search.

Related papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding [45.83476222676765]
We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions.<n>The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process.<n>We show that VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks.
arXiv Detail & Related papers (2025-07-17T17:59:59Z)
Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos [106.5804660736763]
Video information retrieval remains a fundamental approach for accessing video content. We build on the observation that retrieval models often favor AI-generated content in ad-hoc and image retrieval tasks. We investigate whether similar biases emerge in the context of challenging video retrieval.
arXiv Detail & Related papers (2025-02-11T07:43:47Z)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content. We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z)
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos. Our model uses a novel autoregressive factorized decoding architecture. Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z)
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text. This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks. Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z)
DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips. The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z)
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z)
TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification [26.12591949900602]
We formulate a text-based task conditioner to adapt video features to the few-shot learning task. Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.
arXiv Detail & Related papers (2021-06-21T15:08:08Z)
Highlight Timestamp Detection Model for Comedy Videos via Multimodal Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field. We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z)
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z)
Temporal Context Aggregation for Video Retrieval with Contrastive Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features. The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.