Video-Text as Game Players: Hierarchical Banzhaf Interaction for
Cross-Modal Representation Learning
- URL: http://arxiv.org/abs/2303.14369v1
- Date: Sat, 25 Mar 2023 05:47:52 GMT
- Title: Video-Text as Game Players: Hierarchical Banzhaf Interaction for
Cross-Modal Representation Learning
- Authors: Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu,
Xiangyang Ji, Li Yuan, Jie Chen
- Abstract summary: We creatively model video-text as game players with multivariate cooperative game theory.
We propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words.
By stacking token modules merge, we achieve cooperative games at different semantic levels.
- Score: 41.1802201408379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning-based video-language representation learning approaches,
e.g., CLIP, have achieved outstanding performance, which pursue semantic
interaction upon pre-defined video-text pairs. To clarify this coarse-grained
global interaction and move a step further, we have to encounter challenging
shell-breaking interactions for fine-grained cross-modal learning. In this
paper, we creatively model video-text as game players with multivariate
cooperative game theory to wisely handle the uncertainty during fine-grained
semantic interaction with diverse granularity, flexible combination, and vague
intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to
value possible correspondence between video frames and text words for sensitive
and explainable cross-modal contrast. To efficiently realize the cooperative
game of multiple video frames and multiple text words, the proposed method
clusters the original video frames (text words) and computes the Banzhaf
Interaction between the merged tokens. By stacking token merge modules, we
achieve cooperative games at different semantic levels. Extensive experiments
on commonly used text-video retrieval and video-question answering benchmarks
with superior performances justify the efficacy of our HBI. More encouragingly,
it can also serve as a visualization tool to promote the understanding of
cross-modal interaction, which have a far-reaching impact on the community.
Project page is available at https://jpthu17.github.io/HBI/.
Related papers
- Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.
We introduce a new approach that models video-text as game players using multivariate cooperative game theory.
We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - HunYuan_tvr for Text-Video Retrivial [23.650824732136158]
HunYuan_tvr explores hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships.
HunYuan_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet respectively.
arXiv Detail & Related papers (2022-04-07T11:59:36Z) - Bridge-Prompt: Towards Ordinal Action Understanding in Instructional
Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions.
We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics.
Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - COOT: Cooperative Hierarchical Transformer for Video-Text Representation
Learning [0.0]
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.
We propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities.
arXiv Detail & Related papers (2020-11-01T18:54:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.