Video Infringement Detection via Feature Disentanglement and Mutual
Information Maximization
- URL: http://arxiv.org/abs/2309.06877v1
- Date: Wed, 13 Sep 2023 10:53:12 GMT
- Title: Video Infringement Detection via Feature Disentanglement and Mutual
Information Maximization
- Authors: Zhenguang Liu, Xinyang Yu, Ruili Wang, Shuai Ye, Zhe Ma, Jianfeng
Dong, Sifeng He, Feng Qian, Xiaobo Zhang, Roger Zimmermann, Lei Yang
- Abstract summary: We propose to disentangle an original high-dimensional feature into multiple sub-features.
On top of the disentangled sub-features, we learn an auxiliary feature to enhance the sub-features.
Our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset.
- Score: 51.206398602941405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The self-media era provides us tremendous high quality videos. Unfortunately,
frequent video copyright infringements are now seriously damaging the interests
and enthusiasm of video creators. Identifying infringing videos is therefore a
compelling task. Current state-of-the-art methods tend to simply feed
high-dimensional mixed video features into deep neural networks and count on
the networks to extract useful representations. Despite its simplicity, this
paradigm heavily relies on the original entangled features and lacks
constraints guaranteeing that useful task-relevant semantics are extracted from
the features.
In this paper, we seek to tackle the above challenges from two aspects: (1)
We propose to disentangle an original high-dimensional feature into multiple
sub-features, explicitly disentangling the feature into exclusive
lower-dimensional components. We expect the sub-features to encode
non-overlapping semantics of the original feature and remove redundant
information.
(2) On top of the disentangled sub-features, we further learn an auxiliary
feature to enhance the sub-features. We theoretically analyzed the mutual
information between the label and the disentangled features, arriving at a loss
that maximizes the extraction of task-relevant information from the original
feature.
Extensive experiments on two large-scale benchmark datasets (i.e., SVD and
VCSL) demonstrate that our method achieves 90.1% TOP-100 mAP on the large-scale
SVD dataset and also sets the new state-of-the-art on the VCSL benchmark
dataset. Our code and model have been released at
https://github.com/yyyooooo/DMI/, hoping to contribute to the community.
Related papers
- EVC-MF: End-to-end Video Captioning Network with Multi-scale Features [13.85795110061781]
We propose an end-to-end encoder-decoder-based network (EVC-MF) for video captioning.
It efficiently utilizes multi-scale visual and textual features to generate video descriptions.
The results demonstrate that EVC-MF yields competitive performance compared with the state-of-theart methods.
arXiv Detail & Related papers (2024-10-22T02:16:02Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - CapST: An Enhanced and Lightweight Model Attribution Approach for
Synthetic Videos [9.209808258321559]
This paper investigates the model attribution problem of Deepfake videos from a recently proposed dataset, Deepfakes from Different Models (DFDM)
The dataset comprises 6,450 Deepfake videos generated by five distinct models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio.
Experimental results on the deepfake benchmark dataset (DFDM) demonstrate the efficacy of our proposed method, achieving up to a 4% improvement in accurately categorizing deepfake videos.
arXiv Detail & Related papers (2023-11-07T08:05:09Z) - Few-Shot Video Object Detection [70.43402912344327]
We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions.
FSVOD-500 comprises of 500 classes with class-balanced videos in each category for few-shot learning.
Our TPN and TMN+ are jointly and end-to-end trained.
arXiv Detail & Related papers (2021-04-30T07:38:04Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z) - Hybrid-Attention Guided Network with Multiple Resolution Features for
Person Re-Identification [30.285126447140254]
We present a novel person re-ID model that fuses high- and low-level embeddings to reduce the information loss caused in learning high-level features.
We also introduce the spatial and channel attention mechanisms in our model, which aims to mine more discriminative features related to the target.
arXiv Detail & Related papers (2020-09-16T08:12:42Z) - Not 3D Re-ID: a Simple Single Stream 2D Convolution for Robust Video
Re-identification [14.785070524184649]
Video-based Re-ID is an expansion of earlier image-based re-identification methods.
We show superior performance from a simple single stream 2D convolution network leveraging the ResNet50-IBN architecture.
Our approach uses best video Re-ID practice and transfer learning between datasets to outperform existing state-of-the-art approaches.
arXiv Detail & Related papers (2020-08-14T12:19:32Z) - Attribute-aware Identity-hard Triplet Loss for Video-based Person
Re-identification [51.110453988705395]
Video-based person re-identification (Re-ID) is an important computer vision task.
We introduce a new metric learning method called Attribute-aware Identity-hard Triplet Loss (AITL)
To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed.
arXiv Detail & Related papers (2020-06-13T09:15:38Z) - An Attention-Based Deep Learning Model for Multiple Pedestrian
Attributes Recognition [4.6898263272139795]
This paper provides a novel solution to the problem of automatic characterization of pedestrians in surveillance footage.
We propose a multi-task deep model that uses an element-wise multiplication layer to extract more comprehensive feature representations.
Our experiments were performed on two well-known datasets (RAP and PETA) and point for the superiority of the proposed method with respect to the state-of-the-art.
arXiv Detail & Related papers (2020-04-02T16:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.