Exploiting Feature Diversity for Make-up Temporal Video Grounding
- URL: http://arxiv.org/abs/2208.06179v1
- Date: Fri, 12 Aug 2022 09:03:25 GMT
- Title: Exploiting Feature Diversity for Make-up Temporal Video Grounding
- Authors: Xiujun Shu, Wei Wen, Taian Guo, Sunan He, Chen Wu, Ruizhi Qiao
- Abstract summary: This report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022.
MTVG aims at localizing the temporal boundary of the step in an untrimmed video based on a textual description.
- Score: 15.358540603177547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This technical report presents the 3rd winning solution for MTVG, a new task
introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022. MTVG
aims at localizing the temporal boundary of the step in an untrimmed video
based on a textual description. The biggest challenge of this task is the fi
ne-grained video-text semantics of make-up steps. However, current methods
mainly extract video features using action-based pre-trained models. As actions
are more coarse-grained than make-up steps, action-based features are not
sufficient to provide fi ne-grained cues. To address this issue,we propose to
achieve fi ne-grained representation via exploiting feature diversities.
Specifically, we proposed a series of methods from feature extraction, network
optimization, to model ensemble. As a result, we achieved 3rd place in the MTVG
competition.
Related papers
- VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework [33.46782517803435]
Make-Your-Anchor is a system requiring only a one-minute video clip of an individual for training.
We finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances.
A novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos.
arXiv Detail & Related papers (2024-03-25T07:54:18Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding [34.603577827106875]
Make-up temporal video grounding aims to localize the target video segment which is semantically related to a sentence describing a make-up activity, given a long video.
Existing general approaches cannot locate the target activity effectually.
We propose an effective proposal-based framework named Dual-Path Temporal Map Optimization Network (DPTMO) to capture fine-grained multimodal semantic details of make-up activities.
arXiv Detail & Related papers (2023-09-12T12:43:50Z) - Technical Report for Ego4D Long Term Action Anticipation Challenge 2023 [0.0]
We describe the technical details of our approach for the Ego4D Long-Term Action Anticipation Challenge 2023.
The aim of this task is to predict a sequence of future actions that will take place at an arbitrary time or later, given an input video.
Our method outperformed the baseline performance and recorded as second place solution on the public leaderboard.
arXiv Detail & Related papers (2023-07-04T04:12:49Z) - Team PKU-WICT-MIPL PIC Makeup Temporal Video Grounding Challenge 2022
Technical Report [42.49264486550348]
We propose a phrase relationship mining framework to exploit the temporal localization relationship relevant to the fine-grained phrase and the whole sentence.
Besides, we propose to constrain the localization results of different step sentence queries to not overlap with each other.
Our final submission ranked 2nd on the leaderboard, with only a 0.55% gap from the first.
arXiv Detail & Related papers (2022-07-06T13:50:34Z) - Video2StyleGAN: Encoding Video in Latent Space for Manipulation [63.03250800510085]
We propose a novel network to encode face videos into the latent space of StyleGAN for semantic face video manipulation.
Our approach can significantly outperform existing single image methods, while achieving real-time (66 fps) speed.
arXiv Detail & Related papers (2022-06-27T06:48:15Z) - Compositional Video Synthesis with Action Graphs [112.94651460161992]
Videos of actions are complex signals containing rich compositional structure in space and time.
We propose to represent the actions in a graph structure called Action Graph and present the new Action Graph To Video'' synthesis task.
Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.
arXiv Detail & Related papers (2020-06-27T09:39:04Z) - YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in
Domain-Specific Videos [60.62475495522428]
The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos.
We propose two novel question-answering tasks to evaluate models' fine-grained action understanding abilities.
arXiv Detail & Related papers (2020-04-12T09:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.