TempCLR: Temporal Alignment Representation with Contrastive Learning
- URL: http://arxiv.org/abs/2212.13738v2
- Date: Thu, 30 Mar 2023 01:42:53 GMT
- Title: TempCLR: Temporal Alignment Representation with Contrastive Learning
- Authors: Yuncong Yang, Jiawei Ma, Shiyuan Huang, Long Chen, Xudong Lin,
Guangxing Han, Shih-Fu Chang
- Abstract summary: We propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly.
In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances.
- Score: 35.12182087403215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video representation learning has been successful in video-text pre-training
for zero-shot transfer, where each sentence is trained to be close to the
paired video clips in a common feature space. For long videos, given a
paragraph of description where the sentences describe different segments of the
video, by matching all sentence-clip pairs, the paragraph and the full video
are aligned implicitly. However, such unit-level comparison may ignore global
temporal context, which inevitably limits the generalization ability. In this
paper, we propose a contrastive learning framework TempCLR to compare the full
video and the paragraph explicitly. As the video/paragraph is formulated as a
sequence of clips/sentences, under the constraint of their temporal order, we
use dynamic time warping to compute the minimum cumulative cost over
sentence-clip pairs as the sequence-level distance. To explore the temporal
dynamics, we break the consistency of temporal succession by shuffling video
clips w.r.t. temporal granularity. Then, we obtain the representations for
clips/sentences, which perceive the temporal information and thus facilitate
the sequence alignment. In addition to pre-training on the video and paragraph,
our approach can also generalize on the matching between video instances. We
evaluate our approach on video retrieval, action step localization, and
few-shot action recognition, and achieve consistent performance gain over all
three tasks. Detailed ablation studies are provided to justify the approach
design.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding [22.59291334338824]
Correlation-Guided DEtection TRansformer provides clues for query-associated video clips.
CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding.
arXiv Detail & Related papers (2023-11-15T10:22:35Z) - Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Contrastive Language-Action Pre-training for Temporal Localization [64.34349213254312]
Long-form video understanding requires approaches that are able to temporally localize activities or language.
These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations.
We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions.
arXiv Detail & Related papers (2022-04-26T13:17:50Z) - Controllable Augmentations for Video Representation Learning [34.79719112810065]
We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations.
Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
arXiv Detail & Related papers (2022-03-30T19:34:32Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.