Team PKU-WICT-MIPL PIC Makeup Temporal Video Grounding Challenge 2022
Technical Report
- URL: http://arxiv.org/abs/2207.02687v1
- Date: Wed, 6 Jul 2022 13:50:34 GMT
- Title: Team PKU-WICT-MIPL PIC Makeup Temporal Video Grounding Challenge 2022
Technical Report
- Authors: Minghang Zheng, Dejie Yang, Zhongjie Ye, Ting Lei, Yuxin Peng and Yang
Liu
- Abstract summary: We propose a phrase relationship mining framework to exploit the temporal localization relationship relevant to the fine-grained phrase and the whole sentence.
Besides, we propose to constrain the localization results of different step sentence queries to not overlap with each other.
Our final submission ranked 2nd on the leaderboard, with only a 0.55% gap from the first.
- Score: 42.49264486550348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this technical report, we briefly introduce the solutions of our team
`PKU-WICT-MIPL' for the PIC Makeup Temporal Video Grounding (MTVG) Challenge in
ACM-MM 2022. Given an untrimmed makeup video and a step query, the MTVG aims to
localize a temporal moment of the target makeup step in the video. To tackle
this task, we propose a phrase relationship mining framework to exploit the
temporal localization relationship relevant to the fine-grained phrase and the
whole sentence. Besides, we propose to constrain the localization results of
different step sentence queries to not overlap with each other through a
dynamic programming algorithm. The experimental results demonstrate the
effectiveness of our method. Our final submission ranked 2nd on the
leaderboard, with only a 0.55\% gap from the first.
Related papers
- 1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [81.50620771207329]
We investigate the effectiveness of static-dominant data and frame sampling on referring video object segmentation (RVOS)
Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge.
arXiv Detail & Related papers (2024-06-11T08:05:26Z) - The Runner-up Solution for YouTube-VIS Long Video Challenge 2022 [72.13080661144761]
We adopt the previously proposed online video instance segmentation method IDOL for this challenge.
We use pseudo labels to further help contrastive learning, so as to obtain more temporally consistent instance embedding.
The proposed method obtains 40.2 AP on the YouTube-VIS 2022 long video dataset and was ranked second in this challenge.
arXiv Detail & Related papers (2022-11-18T01:40:59Z) - Exploiting Feature Diversity for Make-up Temporal Video Grounding [15.358540603177547]
This report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022.
MTVG aims at localizing the temporal boundary of the step in an untrimmed video based on a textual description.
arXiv Detail & Related papers (2022-08-12T09:03:25Z) - ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries
Challenge 2022 [61.81899056005645]
Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained.
We propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips.
The experimental results demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2022-07-01T12:48:35Z) - Technical Report for CVPR 2022 LOVEU AQTC Challenge [3.614550981030065]
This report presents the 2nd winning model for AQTC, a task newly introduced in CVPR 2022 LOng-form VidEo Understanding (LOVEU) challenges.
This challenge faces difficulties with multi-step answers, multi-modal, and diverse and changing button representations in video.
We propose a new context ground module attention mechanism for more effective feature mapping.
arXiv Detail & Related papers (2022-06-29T12:07:43Z) - Exploiting Semantic Role Contextualized Video Features for
Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022 [72.12974259966592]
We present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022.
We first parse sentences into semantic roles corresponding to verbs and nouns, then utilize self-attentions to exploit semantic role contextualized video features.
arXiv Detail & Related papers (2022-06-29T03:24:43Z) - Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video [53.69956349097428]
Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
arXiv Detail & Related papers (2020-01-25T13:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.