WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling
Vision-Language Models Through Open-Vocabulary Knowledge
- URL: http://arxiv.org/abs/2312.09507v3
- Date: Wed, 10 Jan 2024 21:40:46 GMT
- Title: WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling
Vision-Language Models Through Open-Vocabulary Knowledge
- Authors: Huy Le, Tung Kieu, Anh Nguyen, Ngan Le
- Abstract summary: $texttWAVER$ is a cross-domain knowledge distillation framework via vision-language models.
$texttWAVER$ capitalizes on the open-vocabulary properties that lie in pre-trained vision-language models.
It can achieve state-of-the-art performance in text-video retrieval task while handling writing-style variations.
- Score: 12.034917651508524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-video retrieval, a prominent sub-field within the domain of multimodal
information retrieval, has witnessed remarkable growth in recent years.
However, existing methods assume video scenes are consistent with unbiased
descriptions. These limitations fail to align with real-world scenarios since
descriptions can be influenced by annotator biases, diverse writing styles, and
varying textual perspectives. To overcome the aforementioned problems, we
introduce $\texttt{WAVER}$, a cross-domain knowledge distillation framework via
vision-language models through open-vocabulary knowledge designed to tackle the
challenge of handling different writing styles in video descriptions.
$\texttt{WAVER}$ capitalizes on the open-vocabulary properties that lie in
pre-trained vision-language models and employs an implicit knowledge
distillation approach to transfer text-based knowledge from a teacher model to
a vision-based student. Empirical studies conducted across four standard
benchmark datasets, encompassing various settings, provide compelling evidence
that $\texttt{WAVER}$ can achieve state-of-the-art performance in text-video
retrieval task while handling writing-style variations. The code is available
at: https://github.com/Fsoft-AIC/WAVER
Related papers
- SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks.
We generate results comparable to the state-of-the-art on two publicly available datasets.
arXiv Detail & Related papers (2021-08-22T13:21:58Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.