Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding
- URL: http://arxiv.org/abs/2309.06176v1
- Date: Tue, 12 Sep 2023 12:43:50 GMT
- Title: Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding
- Authors: Jiaxiu Li, Kun Li, Jia Li, Guoliang Chen, Dan Guo, Meng Wang
- Abstract summary: Make-up temporal video grounding aims to localize the target video segment which is semantically related to a sentence describing a make-up activity, given a long video.
Existing general approaches cannot locate the target activity effectually.
We propose an effective proposal-based framework named Dual-Path Temporal Map Optimization Network (DPTMO) to capture fine-grained multimodal semantic details of make-up activities.
- Score: 34.603577827106875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Make-up temporal video grounding (MTVG) aims to localize the target video
segment which is semantically related to a sentence describing a make-up
activity, given a long video. Compared with the general video grounding task,
MTVG focuses on meticulous actions and changes on the face. The make-up
instruction step, usually involving detailed differences in products and facial
areas, is more fine-grained than general activities (e.g, cooking activity and
furniture assembly). Thus, existing general approaches cannot locate the target
activity effectually. More specifically, existing proposal generation modules
are not yet fully developed in providing semantic cues for the more
fine-grained make-up semantic comprehension. To tackle this issue, we propose
an effective proposal-based framework named Dual-Path Temporal Map Optimization
Network (DPTMO) to capture fine-grained multimodal semantic details of make-up
activities. DPTMO extracts both query-agnostic and query-guided features to
construct two proposal sets and uses specific evaluation methods for the two
sets. Different from the commonly used single structure in previous methods,
our dual-path structure can mine more semantic information in make-up videos
and distinguish fine-grained actions well. These two candidate sets represent
the cross-modal makeup video-text similarity and multi-modal fusion
relationship, complementing each other. Each set corresponds to its respective
optimization perspective, and their joint prediction enhances the accuracy of
video timestamp prediction. Comprehensive experiments on the YouMakeup dataset
demonstrate our proposed dual structure excels in fine-grained semantic
comprehension.
Related papers
- Storyboard guided Alignment for Fine-grained Video Action Recognition [32.02631248389487]
Fine-grained video action recognition can be conceptualized as a video-text matching problem.
We propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics.
arXiv Detail & Related papers (2024-10-18T07:40:41Z) - Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding [32.117677036812836]
This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding.
Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels.
Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context.
arXiv Detail & Related papers (2024-08-30T17:52:55Z) - Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video
Grounding [59.599378814835205]
Temporal Video Grounding (TVG) aims to localize the temporal boundary of a specific segment in an untrimmed video based on a given language query.
We introduce a novel AMDA method to adaptively adjust the model's scene-related knowledge by incorporating insights from the target data.
arXiv Detail & Related papers (2023-12-21T07:49:27Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Dual Prototype Attention for Unsupervised Video Object Segmentation [28.725754274542304]
Unsupervised video object segmentation (VOS) aims to detect and segment the most salient object in videos.
This paper proposes two novel prototype-based attention mechanisms, inter-modality attention (IMA) and inter-frame attention (IFA)
arXiv Detail & Related papers (2022-11-22T06:19:17Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Unsupervised Temporal Video Grounding with Deep Semantic Clustering [58.95918952149763]
Temporal video grounding aims to localize a target segment in a video according to a given sentence query.
In this paper, we explore whether a video grounding model can be learned without any paired annotations.
Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set.
arXiv Detail & Related papers (2022-01-14T05:16:33Z) - A Simple Yet Effective Method for Video Temporal Grounding with
Cross-Modality Attention [31.218804432716702]
The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.
We propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design.
In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias.
arXiv Detail & Related papers (2020-09-23T16:03:00Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.