Temporal Alignment Networks for Long-term Video
- URL: http://arxiv.org/abs/2204.02968v1
- Date: Wed, 6 Apr 2022 17:59:46 GMT
- Title: Temporal Alignment Networks for Long-term Video
- Authors: Tengda Han, Weidi Xie, Andrew Zisserman
- Abstract summary: We propose a temporal alignment network that ingests long term video sequences, and associated text sentences.
We train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise.
Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset.
- Score: 103.69904379356413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this paper is a temporal alignment network that ingests long
term video sequences, and associated text sentences, in order to: (1) determine
if a sentence is alignable with the video; and (2) if it is alignable, then
determine its alignment. The challenge is to train such networks from
large-scale datasets, such as HowTo100M, where the associated text sentences
have significant noise, and are only weakly aligned when relevant. Apart from
proposing the alignment network, we also make four contributions: (i) we
describe a novel co-training method that enables to denoise and train on raw
instructional videos without using manual annotation, despite the considerable
noise; (ii) to benchmark the alignment performance, we manually curate a
10-hour subset of HowTo100M, totalling 80 videos, with sparse temporal
descriptions. Our proposed model, trained on HowTo100M, outperforms strong
baselines (CLIP, MIL-NCE) on this alignment dataset by a significant margin;
(iii) we apply the trained model in the zero-shot settings to multiple
downstream video understanding tasks and achieve state-of-the-art results,
including text-video retrieval on YouCook2, and weakly supervised video action
segmentation on Breakfast-Action; (iv) we use the automatically aligned
HowTo100M annotations for end-to-end finetuning of the backbone model, and
obtain improved performance on downstream action recognition tasks.
Related papers
- Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning [102.54669633984278]
We propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks.
We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization.
arXiv Detail & Related papers (2024-02-18T03:04:38Z) - Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - TempCLR: Temporal Alignment Representation with Contrastive Learning [35.12182087403215]
We propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly.
In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances.
arXiv Detail & Related papers (2022-12-28T08:10:31Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Self-supervised Video Representation Learning by Context and Motion
Decoupling [45.510042484456854]
A challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias.
We develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task.
Experiments show that our approach improves the quality of the learned video representation over previous works.
arXiv Detail & Related papers (2021-04-02T02:47:34Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.