Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning
- URL: http://arxiv.org/abs/2509.06826v1
- Date: Mon, 08 Sep 2025 16:01:02 GMT
- Title: Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning
- Authors: Dipta Neogi, Nourash Azmine Chowdhury, Muhammad Rafsan Kabir, Mohammad Ashrafuzzaman Khan,
- Abstract summary: We employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning.<n>Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815.<n>We evaluate the model's performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Triple Margint, demonstrating the robustness of our proposed architecture.
- Score: 0.1749935196721634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid growth of visual content consumption across platforms necessitates automated video classification for age-suitability standards like the MPAA rating system (G, PG, PG-13, R). Traditional methods struggle with large labeled data requirements, poor generalization, and inefficient feature learning. To address these challenges, we employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning. Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815. By combining CNNs for spatial features, LSTMs for temporal modeling, and attention mechanisms for dynamic frame prioritization, the model excels in fine-grained borderline distinctions, such as differentiating PG-13 and R-rated content. We evaluate the model's performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Margin Triplet, demonstrating the robustness of our proposed architecture. To ensure practical application, the model is deployed as a web application for real-time MPAA rating classification, offering an efficient solution for automated content compliance across streaming platforms.
Related papers
- CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video [9.172799792564009]
We propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large models.<n>Our approach introduces a quality-aware video metadata mechanism that integrates key fragments extracted from inter-frame variations.<n>Our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations.
arXiv Detail & Related papers (2025-11-10T16:37:47Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - DyCAF-Net: Dynamic Class-Aware Fusion Network [0.0]
We introduce Dynamic Class-Aware Fusion Network (DyCAF-Net)<n>DyCAF-Net achieves significant improvements in precision, mAP@50, and mAP@50-95 across 13 diverse benchmarks.<n>Its adaptability to scale variance, semantic overlaps, and class imbalance positions it as a robust solution for real-world detection tasks.
arXiv Detail & Related papers (2025-08-05T16:06:26Z) - A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition [0.0]
We propose a hybrid framework combining 3D CNN and Transformer architectures.<n>The 3D CNN module extracts low-leveltemporal features, while the Transformer module captures long-range temporal dependencies.<n>The proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity.
arXiv Detail & Related papers (2025-08-02T07:33:29Z) - Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models [3.7049613588433497]
Video frame prediction has critical applications in weather forecasting or autonomous systems.<n>This paper evaluates several hybrid deep learning approaches that combine the feature extraction capabilities of autoencoders with temporal sequence modelling.
arXiv Detail & Related papers (2025-07-28T10:07:00Z) - Embedding-based Retrieval in Multimodal Content Moderation [20.899256623912933]
We introduce an Embedding-Based Retrieval (EBR) method designed to complement traditional classification approaches.<n>EBR increases action rates by 10.32% and reduces operational costs by over 80%, while also enhancing interpretability and flexibility.
arXiv Detail & Related papers (2025-06-30T19:11:25Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Switchable Representation Learning Framework with Self-compatibility [50.48336074436792]
We propose a Switchable representation learning Framework with Self-Compatibility (SFSC)
SFSC generates a series of compatible sub-models with different capacities through one training process.
SFSC achieves state-of-the-art performance on the evaluated datasets.
arXiv Detail & Related papers (2022-06-16T16:46:32Z) - Imposing Consistency for Optical Flow Estimation [73.53204596544472]
Imposing consistency through proxy tasks has been shown to enhance data-driven learning.
This paper introduces novel and effective consistency strategies for optical flow estimation.
arXiv Detail & Related papers (2022-04-14T22:58:30Z) - Colar: Effective and Efficient Online Action Detection by Consulting
Exemplars [102.28515426925621]
We develop an effective exemplar-consultation mechanism that first measures the similarity between a frame and exemplary frames, and then aggregates exemplary features based on the similarity weights.
Due to the complementarity from the category-level modeling, our method employs a lightweight architecture but achieves new high performance on three benchmarks.
arXiv Detail & Related papers (2022-03-02T12:13:08Z) - HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN [0.0]
We propose a novel deep generative network architecture designed specifically for high-resolution video synthesis.<n>Our approach integrates key concepts from Wasserstein Generative Adrial Networks (WGANs)<n>Our training objective combines a pixel-wise mean squared error loss with an adversarial loss to balance frame-level accuracy and video realism.
arXiv Detail & Related papers (2020-08-17T20:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.