COOT: Cooperative Hierarchical Transformer for Video-Text Representation
Learning
- URL: http://arxiv.org/abs/2011.00597v1
- Date: Sun, 1 Nov 2020 18:54:09 GMT
- Title: COOT: Cooperative Hierarchical Transformer for Video-Text Representation
Learning
- Authors: Simon Ging (1), Mohammadreza Zolfaghari (1), Hamed Pirsiavash (2),
Thomas Brox (1) ((1) University of Freiburg, (2) University of Maryland
Baltimore County)
- Abstract summary: Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.
We propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many real-world video-text tasks involve different levels of granularity,
such as frames and words, clip and sentences or videos and paragraphs, each
with distinct semantics. In this paper, we propose a Cooperative hierarchical
Transformer (COOT) to leverage this hierarchy information and model the
interactions between different levels of granularity and different modalities.
The method consists of three major components: an attention-aware feature
aggregation layer, which leverages the local temporal context (intra-level,
e.g., within a clip), a contextual transformer to learn the interactions
between low-level and high-level semantics (inter-level, e.g. clip-video,
sentence-paragraph), and a cross-modal cycle-consistency loss to connect video
and text. The resulting method compares favorably to the state of the art on
several benchmarks while having few parameters. All code is available
open-source at https://github.com/gingsi/coot-videotext
Related papers
- GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts [48.28000728061778]
We propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene.
Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model.
arXiv Detail & Related papers (2024-04-08T18:24:12Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels.
We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels.
Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - HANet: Hierarchical Alignment Networks for Video-Text Retrieval [15.91922397215452]
Video-text retrieval is an important yet challenging task in vision-language understanding.
Most current works simply measure the video-text similarity based on video-level and text-level embeddings.
We propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching.
arXiv Detail & Related papers (2021-07-26T09:28:50Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.