Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of
Foundation Models for Open-World Video Recognition
- URL: http://arxiv.org/abs/2402.18951v1
- Date: Thu, 29 Feb 2024 08:29:03 GMT
- Title: Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of
Foundation Models for Open-World Video Recognition
- Authors: Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang
- Abstract summary: We propose a generic knowledge transfer pipeline to boost open-world video recognition.
We name it PCA, based on three stages of Percept, Chat, and Adapt.
Our approach achieves state-of-the-art performance on all three datasets.
- Score: 36.56176821492121
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-world video recognition is challenging since traditional networks are
not generalized well on complex environment variations. Alternatively,
foundation models with rich knowledge have recently shown their generalization
power. However, how to apply such knowledge has not been fully explored for
open-world video recognition. To this end, we propose a generic knowledge
transfer pipeline, which progressively exploits and integrates external
multimodal knowledge from foundation models to boost open-world video
recognition. We name it PCA, based on three stages of Percept, Chat, and Adapt.
First, we perform Percept process to reduce the video domain gap and obtain
external visual knowledge. Second, we generate rich linguistic semantics as
external textual knowledge in Chat stage. Finally, we blend external multimodal
knowledge in Adapt stage, by inserting multimodal knowledge adaptation modules
into networks. We conduct extensive experiments on three challenging open-world
video benchmarks, i.e., TinyVIRAT, ARID, and QV-Pipe. Our approach achieves
state-of-the-art performance on all three datasets.
Related papers
- SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [60.76719375410635]
We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos.
The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving.
We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
arXiv Detail & Related papers (2024-05-15T21:55:31Z) - Open Visual Knowledge Extraction via Relation-Oriented Multimodality
Model Prompting [89.95541601837719]
We take a first exploration to a new paradigm of open visual knowledge extraction.
OpenVik consists of an open relational region detector to detect regions potentially containing relational knowledge.
A visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest.
arXiv Detail & Related papers (2023-10-28T20:09:29Z) - Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in
Open Worlds [37.22688246779871]
Large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world.
LLMs tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game"
We propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation.
arXiv Detail & Related papers (2023-10-20T03:22:05Z) - Multimodal Short Video Rumor Detection System Based on Contrastive
Learning [3.4192832062683842]
Short video platforms in China have gradually evolved into fertile grounds for the proliferation of fake news.
distinguishing short video rumors poses a significant challenge due to the substantial amount of information and shared features.
Our research group proposes a methodology encompassing multimodal feature fusion and the integration of external knowledge.
arXiv Detail & Related papers (2023-04-17T16:07:00Z) - VLG: General Video Recognition with Web Textual Knowledge [47.3660792813967]
We focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework.
By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we present a unified visual-linguistic framework (VLG)
Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.
arXiv Detail & Related papers (2022-12-03T15:46:49Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z) - Common Sense or World Knowledge? Investigating Adapter-Based Knowledge
Injection into Pretrained Transformers [54.417299589288184]
We investigate models for complementing the distributional knowledge of BERT with conceptual knowledge from ConceptNet and its corresponding Open Mind Common Sense (OMCS) corpus.
Our adapter-based models substantially outperform BERT on inference tasks that require the type of conceptual knowledge explicitly present in ConceptNet and OMCS.
arXiv Detail & Related papers (2020-05-24T15:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.