VLG: General Video Recognition with Web Textual Knowledge
- URL: http://arxiv.org/abs/2212.01638v1
- Date: Sat, 3 Dec 2022 15:46:49 GMT
- Title: VLG: General Video Recognition with Web Textual Knowledge
- Authors: Jintao Lin, Zhaoyang Liu, Wenhai Wang, Wayne Wu, Limin Wang
- Abstract summary: We focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework.
By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we present a unified visual-linguistic framework (VLG)
Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.
- Score: 47.3660792813967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video recognition in an open and dynamic world is quite challenging, as we
need to handle different settings such as close-set, long-tail, few-shot and
open-set. By leveraging semantic knowledge from noisy text descriptions crawled
from the Internet, we focus on the general video recognition (GVR) problem of
solving different recognition tasks within a unified framework. The core
contribution of this paper is twofold. First, we build a comprehensive video
recognition benchmark of Kinetics-GVR, including four sub-task datasets to
cover the mentioned settings. To facilitate the research of GVR, we propose to
utilize external textual knowledge from the Internet and provide multi-source
text descriptions for all action classes. Second, inspired by the flexibility
of language representation, we present a unified visual-linguistic framework
(VLG) to solve the problem of GVR by an effective two-stage training paradigm.
Our VLG is first pre-trained on video and language datasets to learn a shared
feature space, and then devises a flexible bi-modal attention head to
collaborate high-level semantic concepts under different settings. Extensive
results show that our VLG obtains the state-of-the-art performance under four
settings. The superior performance demonstrates the effectiveness and
generalization ability of our proposed framework. We hope our work makes a step
towards the general video recognition and could serve as a baseline for future
research. The code and models will be available at
https://github.com/MCG-NJU/VLG.
Related papers
- VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - UniVTG: Towards Unified Video-Language Temporal Grounding [52.56732639951834]
Video Temporal Grounding (VTG) aims to ground target clips from videos according to custom language queries.
We propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions.
Thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels.
arXiv Detail & Related papers (2023-07-31T14:34:49Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.