Related papers: Temporal Alignment Networks for Long-term Video

Temporal Alignment Networks for Long-term Video

URL: http://arxiv.org/abs/2204.02968v1
Date: Wed, 6 Apr 2022 17:59:46 GMT
Title: Temporal Alignment Networks for Long-term Video
Authors: Tengda Han, Weidi Xie, Andrew Zisserman
Abstract summary: We propose a temporal alignment network that ingests long term video sequences, and associated text sentences. We train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise. Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset.
Score: 103.69904379356413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment. The challenge is to train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise, and are only weakly aligned when relevant. Apart from proposing the alignment network, we also make four contributions: (i) we describe a novel co-training method that enables to denoise and train on raw instructional videos without using manual annotation, despite the considerable noise; (ii) to benchmark the alignment performance, we manually curate a 10-hour subset of HowTo100M, totalling 80 videos, with sparse temporal descriptions. Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset by a significant margin; (iii) we apply the trained model in the zero-shot settings to multiple downstream video understanding tasks and achieve state-of-the-art results, including text-video retrieval on YouCook2, and weakly supervised video action segmentation on Breakfast-Action; (iv) we use the automatically aligned HowTo100M annotations for end-to-end finetuning of the backbone model, and obtain improved performance on downstream action recognition tasks.

Related papers

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models [48.00262713744499]
VideoComp is a benchmark and learning framework for advancing video-text compositionality understanding. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences.
arXiv Detail & Related papers (2025-04-04T22:24:30Z)
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning [102.54669633984278]
We propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization.
arXiv Detail & Related papers (2024-02-18T03:04:38Z)
Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset. We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z)
HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z)
TempCLR: Temporal Alignment Representation with Contrastive Learning [35.12182087403215]
We propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances.
arXiv Detail & Related papers (2022-12-28T08:10:31Z)
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization. Existing video summarization datasets rely on manual frame-level annotations. We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
Self-supervised Video Representation Learning by Context and Motion Decoupling [45.510042484456854]
A challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias. We develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task. Experiments show that our approach improves the quality of the learned video representation over previous works.
arXiv Detail & Related papers (2021-04-02T02:47:34Z)
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.