C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual
Text-Video Retrieval
- URL: http://arxiv.org/abs/2210.03625v2
- Date: Tue, 9 May 2023 19:58:59 GMT
- Title: C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual
Text-Video Retrieval
- Authors: Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas,
Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde
Kuehne, James Glass
- Abstract summary: We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval.
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages.
We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages.
- Score: 39.41224716332499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual text-video retrieval methods have improved significantly in
recent years, but the performance for other languages lags behind English. We
propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve
multilingual text-video retrieval. Inspired by the fact that English text-video
retrieval outperforms other languages, we train a student model using input
text in different languages to match the cross-modal predictions from teacher
models using input text in English. We propose a cross entropy based objective
which forces the distribution over the student's text-video similarity scores
to be similar to those of the teacher models. We introduce a new multilingual
video dataset, Multi-YouCook2, by translating the English captions in the
YouCook2 video dataset to 8 other languages. Our method improves multilingual
text-video retrieval performance on Multi-YouCook2 and several other datasets
such as Multi-MSRVTT and VATEX. We also conducted an analysis on the
effectiveness of different multilingual text models as teachers. The code,
models, and dataset are available at https://github.com/roudimit/c2kd.
Related papers
- MuMUR : Multilingual Multimodal Universal Retrieval [19.242056928318913]
We propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs.
We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space.
arXiv Detail & Related papers (2022-08-24T13:55:15Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs)
Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z) - Towards Zero-shot Cross-lingual Image Retrieval and Tagging [1.4425878137951236]
We present a zero-shot approach for learning multi-modal representations using cross-lingual pre-training on the text side.
We introduce a new 1K multi-lingual MSCOCO2014 caption test dataset (XTD10) in 7 languages that we collected using a crowdsourcing platform.
We also demonstrate how a cross-lingual model can be used for downstream tasks like multi-lingual image tagging in a zero shot manner.
arXiv Detail & Related papers (2021-09-15T23:39:15Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual
Transfer of Vision-Language Models [144.85290716246533]
We study zero-shot cross-lingual transfer of vision-language models.
We propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
arXiv Detail & Related papers (2021-03-16T04:37:40Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.