MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in
Indonesian
- URL: http://arxiv.org/abs/2306.11341v1
- Date: Tue, 20 Jun 2023 07:19:36 GMT
- Title: MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in
Indonesian
- Authors: Willy Fitra Hendria
- Abstract summary: We construct the first public Indonesian video-text dataset by translating English sentences from the MSVD dataset to Indonesian sentences.
We then train neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal learning on video and text data has been receiving growing
attention from many researchers in various research tasks, including
text-to-video retrieval, video-to-text retrieval, and video captioning.
Although many algorithms have been proposed for those challenging tasks, most
of them are developed on English language datasets. Despite Indonesian being
one of the most spoken languages in the world, the research progress on the
multimodal video-text with Indonesian sentences is still under-explored, likely
due to the absence of the public benchmark dataset. To address this issue, we
construct the first public Indonesian video-text dataset by translating English
sentences from the MSVD dataset to Indonesian sentences. Using our dataset, we
then train neural network models which were developed for the English
video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text
retrieval, and video captioning. The recent neural network-based approaches to
video-text tasks often utilized a feature extractor that is primarily
pretrained on an English vision-language dataset. Since the availability of the
pretraining resources with Indonesian sentences is relatively limited, the
applicability of those approaches to our dataset is still questionable. To
overcome the lack of pretraining resources, we apply cross-lingual transfer
learning by utilizing the feature extractors pretrained on the English dataset,
and we then fine-tune the models on our Indonesian dataset. Our experimental
results show that this approach can help to improve the performance for the
three tasks on all metrics. Finally, we discuss potential future works using
our dataset, inspiring further research in the Indonesian multimodal video-text
tasks. We believe that our dataset and our experimental results could provide
valuable contributions to the community. Our dataset is available on GitHub.
Related papers
- ViLCo-Bench: VIdeo Language COntinual learning Benchmark [8.660555226687098]
We present ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks.
The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets.
We introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects.
arXiv Detail & Related papers (2024-06-19T00:38:19Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual
Text-Video Retrieval [39.41224716332499]
We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval.
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages.
We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages.
arXiv Detail & Related papers (2022-10-07T15:30:24Z) - MuMUR : Multilingual Multimodal Universal Retrieval [19.242056928318913]
We propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs.
We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space.
arXiv Detail & Related papers (2022-08-24T13:55:15Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural
Language Understanding [41.691861010118394]
We introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding tasks.
IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity.
The datasets for the tasks lie in different domains and styles to ensure task diversity.
We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B.
arXiv Detail & Related papers (2020-09-11T12:21:41Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.