EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video
Grounding with Multimodal Large Language Model
- URL: http://arxiv.org/abs/2312.02483v2
- Date: Wed, 6 Mar 2024 08:23:39 GMT
- Title: EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video
Grounding with Multimodal Large Language Model
- Authors: Guozhang Li, Xinpeng Ding, De Cheng, Jie Li, Nannan Wang and Xinbo Gao
- Abstract summary: We propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries.
Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries.
- Score: 63.93372634950661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Early weakly supervised video grounding (WSVG) methods often struggle with
incomplete boundary detection due to the absence of temporal boundary
annotations. To bridge the gap between video-level and boundary-level
annotation, explicit-supervision methods, i.e., generating pseudo-temporal
boundaries for training, have achieved great success. However, data
augmentations in these methods might disrupt critical temporal information,
yielding poor pseudo boundaries. In this paper, we propose a new perspective
that maintains the integrity of the original temporal content while introducing
more valuable information for expanding the incomplete boundaries. To this end,
we propose EtC (Expand then Clarify), first use the additional information to
expand the initial incomplete pseudo boundaries, and subsequently refine these
expanded ones to achieve precise boundaries. Motivated by video continuity,
i.e., visual similarity across adjacent frames, we use powerful multimodal
large language models (MLLMs) to annotate each frame within initial pseudo
boundaries, yielding more comprehensive descriptions for expanded boundaries.
To further clarify the noise of expanded boundaries, we combine mutual learning
with a tailored proposal-level contrastive objective to use a learnable
approach to harmonize a balance between incomplete yet clean (initial) and
comprehensive yet noisy (expanded) boundaries for more precise ones.
Experiments demonstrate the superiority of our method on two challenging WSVG
datasets.
Related papers
- Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods.
Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z) - Temporal Perceiver: A General Architecture for Arbitrary Boundary
Detection [48.33132632418303]
Generic Boundary Detection (GBD) aims at locating general boundaries that divide videos into semantically coherent and taxonomy-free units.
Previous research separately handle these different-level generic boundaries with specific designs of complicated deep networks from simple CNN to LSTM.
We present Temporal Perceiver, a general architecture with Transformers, offering a unified solution to the detection of arbitrary generic boundaries.
arXiv Detail & Related papers (2022-03-01T09:31:30Z) - Boundary Guided Context Aggregation for Semantic Segmentation [23.709865471981313]
We exploit boundary as a significant guidance for context aggregation to promote the overall semantic understanding of an image.
We conduct extensive experiments on the Cityscapes and ADE20K databases, and comparable results are achieved with the state-of-the-art methods.
arXiv Detail & Related papers (2021-10-27T17:04:38Z) - Internal Video Inpainting by Implicit Long-range Propagation [39.89676105875726]
We propose a novel framework for video inpainting by adopting an internal learning strategy.
We show that this can be achieved implicitly by fitting a convolutional neural network to the known region.
We extend the proposed method to another challenging task: learning to remove an object from a video giving a single object mask in only one frame in a 4K video.
arXiv Detail & Related papers (2021-08-04T08:56:28Z) - Boundary-sensitive Pre-training for Temporal Localization in Videos [124.40788524169668]
We investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext ( BSP) task.
With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types.
Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart.
arXiv Detail & Related papers (2020-11-21T17:46:24Z) - Reinforcement Learning for Weakly Supervised Temporal Grounding of
Natural Language in Untrimmed Videos [134.78406021194985]
We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary.
We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
arXiv Detail & Related papers (2020-09-18T03:32:47Z) - Flow-edge Guided Video Completion [66.49077223104533]
Previous flow completion methods are often unable to retain the sharpness of motion boundaries.
Our method first extracts and completes motion edges, and then uses them to guide piecewise-smooth flow completion with sharp edges.
arXiv Detail & Related papers (2020-09-03T17:59:42Z) - Video Region Annotation with Sparse Bounding Boxes [29.323784279321337]
We learn to automatically generate region boundaries for all frames of a video from sparsely annotated bounding boxes of target regions.
We achieve this with a Volumetric Graph Convolutional Network (VGCN), which learns to iteratively find keypoints on the region boundaries.
arXiv Detail & Related papers (2020-08-17T01:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.