Related papers: Top1 Solution of QQ Browser 2021 Ai Algorithm Competition Track 1 : Multimodal Video Similarity

Top1 Solution of QQ Browser 2021 Ai Algorithm Competition Track 1 : Multimodal Video Similarity

URL: http://arxiv.org/abs/2111.01677v1
Date: Sat, 30 Oct 2021 15:38:04 GMT
Title: Top1 Solution of QQ Browser 2021 Ai Algorithm Competition Track 1 : Multimodal Video Similarity
Authors: Zhuoran Ma, Majing Lou, Xuan Ouyang
Abstract summary: We describe the solution to the QQ Browser 2021 Ai Algorithm Competition (AIAC) Track 1. In the pretrain phase, we train the model with three tasks, (1) Video Tag Classification (VTC), (2) Mask Language Modeling (MLM) and (3) Mask Frame Modeling (MFM) In the finetune phase, we train the model with video similarity based on rank normalized human labels. Our full pipeline, after ensembling several models, scores 0.852 on the leaderboard, which we achieved the 1st place in the competition.
Score: 0.6445605125467573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we describe the solution to the QQ Browser 2021 Ai Algorithm Competition (AIAC) Track 1. We use the multi-modal transformer model for the video embedding extraction. In the pretrain phase, we train the model with three tasks, (1) Video Tag Classification (VTC), (2) Mask Language Modeling (MLM) and (3) Mask Frame Modeling (MFM). In the finetune phase, we train the model with video similarity based on rank normalized human labels. Our full pipeline, after ensembling several models, scores 0.852 on the leaderboard, which we achieved the 1st place in the competition. The source codes have been released at Github.

Related papers

Predicting Brain Responses To Natural Movies With Multimodal LLMs [0.881196878143281]
We present MedARC's team solution to the Algonauts 2025 challenge.<n>Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5- Omni)<n>Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition.
arXiv Detail & Related papers (2025-07-26T13:57:08Z)
OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions [96.31455979495398]
We develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video.<n>We also propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE)<n>Our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-06-29T18:43:00Z)
1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z)
Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors [117.61449210940955]
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
arXiv Detail & Related papers (2023-06-21T06:18:05Z)
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z)
Class-attention Video Transformer for Engagement Intensity Prediction [20.430266245901684]
CavT is a method to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos. CavT achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, and the state-of-the-art MSE (0.0377) on the DAiSEE dataset.
arXiv Detail & Related papers (2022-08-12T01:21:30Z)
ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification. Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers. We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.