HierVL: Learning Hierarchical Video-Language Embeddings
- URL: http://arxiv.org/abs/2301.02311v2
- Date: Thu, 8 Jun 2023 14:29:35 GMT
- Title: HierVL: Learning Hierarchical Video-Language Embeddings
- Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
- Abstract summary: HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
- Score: 108.77600799637172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-language embeddings are a promising avenue for injecting semantics into
visual representations, but existing methods capture only short-term
associations between seconds-long video clips and their accompanying text. We
propose HierVL, a novel hierarchical video-language embedding that
simultaneously accounts for both long-term and short-term associations. As
training data, we take videos accompanied by timestamped text descriptions of
human actions, together with a high-level text summary of the activity
throughout the long video (as are available in Ego4D). We introduce a
hierarchical contrastive training objective that encourages text-visual
alignment at both the clip level and video level. While the clip-level
constraints use the step-by-step descriptions to capture what is happening in
that instant, the video-level constraints use the summary text to capture why
it is happening, i.e., the broader context for the activity and the intent of
the actor. Our hierarchical scheme yields a clip representation that
outperforms its single-level counterpart as well as a long-term video
representation that achieves SotA results on tasks requiring long-term video
modeling. HierVL successfully transfers to multiple challenging downstream
tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and
fine-tuned settings.
Related papers
- LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale.
Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z) - Video ReCap: Recursive Captioning of Hour-Long Videos [42.878517455453824]
Video ReCap can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels.
We utilize a curriculum learning scheme to learn the hierarchical structure of videos, starting from clip-level captions to segment-level descriptions.
Our model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks.
arXiv Detail & Related papers (2024-02-20T18:58:54Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - TempCLR: Temporal Alignment Representation with Contrastive Learning [35.12182087403215]
We propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly.
In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances.
arXiv Detail & Related papers (2022-12-28T08:10:31Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail
Enhancement [44.228748086927375]
We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement.
To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
arXiv Detail & Related papers (2020-03-08T04:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.