Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos
- URL: http://arxiv.org/abs/2303.06378v2
- Date: Wed, 17 May 2023 09:47:49 GMT
- Title: Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos
- Authors: Teng Wang, Jinrui Zhang, Feng Zheng, Wenhao Jiang, Ran Cheng, Ping Luo
- Abstract summary: We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
- Score: 57.830865926459914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint video-language learning has received increasing attention in recent
years. However, existing works mainly focus on single or multiple trimmed video
clips (events), which makes human-annotated event boundaries necessary during
inference. To break away from the ties, we propose a grounded vision-language
learning framework for untrimmed videos, which automatically detects
informative events and effectively excavates the alignments between
multi-sentence descriptions and corresponding event segments. Instead of
coarse-level video-language alignments, we present two dual pretext tasks to
encourage fine-grained segment-level alignments, i.e., text-to-event grounding
(TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the
possible event proposals given a set of sentences by estimating the cross-modal
distance in a joint semantic space. Meanwhile, ETG aims to reconstruct
(generate) the matched texts given event proposals, encouraging the event
representation to retain meaningful semantic information. To encourage accurate
label assignment between the event set and the text set, we propose a novel
semantic-aware cost to mitigate the sub-optimal matching results caused by
ambiguous boundary annotations. Our framework is easily extensible to tasks
covering visually-grounded language understanding and generation. We achieve
state-of-the-art dense video captioning performance on ActivityNet Captions,
YouCook2 and YouMakeup, and competitive performance on several other language
generation and understanding tasks. Our method also achieved 1st place in both
the MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publicly
available at https://github.com/zjr2000/GVL.
Related papers
- Boosting Weakly-Supervised Temporal Action Localization with Text
Information [94.48602948837664]
We propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments.
We also introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence.
Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin.
arXiv Detail & Related papers (2023-05-01T00:07:09Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - HANet: Hierarchical Alignment Networks for Video-Text Retrieval [15.91922397215452]
Video-text retrieval is an important yet challenging task in vision-language understanding.
Most current works simply measure the video-text similarity based on video-level and text-level embeddings.
We propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching.
arXiv Detail & Related papers (2021-07-26T09:28:50Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.