Related papers: Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning

Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning

URL: http://arxiv.org/abs/2509.06826v1
Date: Mon, 08 Sep 2025 16:01:02 GMT
Title: Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning
Authors: Dipta Neogi, Nourash Azmine Chowdhury, Muhammad Rafsan Kabir, Mohammad Ashrafuzzaman Khan,
Abstract summary: We employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning.<n>Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815.<n>We evaluate the model's performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Triple Margint, demonstrating the robustness of our proposed architecture.
Score: 0.1749935196721634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid growth of visual content consumption across platforms necessitates automated video classification for age-suitability standards like the MPAA rating system (G, PG, PG-13, R). Traditional methods struggle with large labeled data requirements, poor generalization, and inefficient feature learning. To address these challenges, we employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning. Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815. By combining CNNs for spatial features, LSTMs for temporal modeling, and attention mechanisms for dynamic frame prioritization, the model excels in fine-grained borderline distinctions, such as differentiating PG-13 and R-rated content. We evaluate the model's performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Margin Triplet, demonstrating the robustness of our proposed architecture. To ensure practical application, the model is deployed as a web application for real-time MPAA rating classification, offering an efficient solution for automated content compliance across streaming platforms.

Related papers

CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video [9.172799792564009]
We propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large models.<n>Our approach introduces a quality-aware video metadata mechanism that integrates key fragments extracted from inter-frame variations.<n>Our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations.
arXiv Detail & Related papers (2025-11-10T16:37:47Z)
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z)
DyCAF-Net: Dynamic Class-Aware Fusion Network [0.0]
We introduce Dynamic Class-Aware Fusion Network (DyCAF-Net)<n>DyCAF-Net achieves significant improvements in precision, mAP@50, and mAP@50-95 across 13 diverse benchmarks.<n>Its adaptability to scale variance, semantic overlaps, and class imbalance positions it as a robust solution for real-world detection tasks.
arXiv Detail & Related papers (2025-08-05T16:06:26Z)
A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition [0.0]
We propose a hybrid framework combining 3D CNN and Transformer architectures.<n>The 3D CNN module extracts low-leveltemporal features, while the Transformer module captures long-range temporal dependencies.<n>The proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity.
arXiv Detail & Related papers (2025-08-02T07:33:29Z)
Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models [3.7049613588433497]
Video frame prediction has critical applications in weather forecasting or autonomous systems.<n>This paper evaluates several hybrid deep learning approaches that combine the feature extraction capabilities of autoencoders with temporal sequence modelling.
arXiv Detail & Related papers (2025-07-28T10:07:00Z)
Embedding-based Retrieval in Multimodal Content Moderation [20.899256623912933]
We introduce an Embedding-Based Retrieval (EBR) method designed to complement traditional classification approaches.<n>EBR increases action rates by 10.32% and reduces operational costs by over 80%, while also enhancing interpretability and flexibility.
arXiv Detail & Related papers (2025-06-30T19:11:25Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z)
Switchable Representation Learning Framework with Self-compatibility [50.48336074436792]
We propose a Switchable representation learning Framework with Self-Compatibility (SFSC) SFSC generates a series of compatible sub-models with different capacities through one training process. SFSC achieves state-of-the-art performance on the evaluated datasets.
arXiv Detail & Related papers (2022-06-16T16:46:32Z)
Imposing Consistency for Optical Flow Estimation [73.53204596544472]
Imposing consistency through proxy tasks has been shown to enhance data-driven learning. This paper introduces novel and effective consistency strategies for optical flow estimation.
arXiv Detail & Related papers (2022-04-14T22:58:30Z)
Colar: Effective and Efficient Online Action Detection by Consulting Exemplars [102.28515426925621]
We develop an effective exemplar-consultation mechanism that first measures the similarity between a frame and exemplary frames, and then aggregates exemplary features based on the similarity weights. Due to the complementarity from the category-level modeling, our method employs a lightweight architecture but achieves new high performance on three benchmarks.
arXiv Detail & Related papers (2022-03-02T12:13:08Z)
HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN [0.0]
We propose a novel deep generative network architecture designed specifically for high-resolution video synthesis.<n>Our approach integrates key concepts from Wasserstein Generative Adrial Networks (WGANs)<n>Our training objective combines a pixel-wise mean squared error loss with an adversarial loss to balance frame-level accuracy and video realism.
arXiv Detail & Related papers (2020-08-17T20:45:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.