Related papers: MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian

MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian

URL: http://arxiv.org/abs/2306.11341v2
Date: Sat, 12 Jul 2025 04:28:35 GMT
Title: MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
Authors: Willy Fitra Hendria,
Abstract summary: We introduce the first public Indonesian video-text dataset by translating the English captions in the MSVD dataset into Indonesian.<n>We evaluate neural network models which were developed for the English video-text dataset on three tasks.<n>We apply a cross-lingual transfer learning approach by leveraging English-pretrained extractors and fine-tuning models on our Indonesian dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal learning on video and text has seen significant progress, particularly in tasks like text-to-video retrieval, video-to-text retrieval, and video captioning. However, most existing methods and datasets focus exclusively on English. Despite Indonesian being one of the most widely spoken languages, multimodal research in Indonesian remains under-explored, largely due to the lack of benchmark datasets. To address this gap, we introduce the first public Indonesian video-text dataset by translating the English captions in the MSVD dataset into Indonesian. Using this dataset, we evaluate neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning. Most existing models rely on feature extractors pretrained on English vision-language datasets, raising concerns about their applicability to Indonesian, given the scarcity of large-scale pretraining resources in the language. We apply a cross-lingual transfer learning approach by leveraging English-pretrained extractors and fine-tuning models on our Indonesian dataset. Experimental results demonstrate that this strategy improves performance across all tasks and metrics. We release our dataset publicly to support future research and hope it will inspire further progress in Indonesian multimodal learning.

Related papers

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models [28.716852515539497]
This study created an extended dataset in multiple languages without relying on machine translation.<n>It examined whether Instruction-Tuning in resource-rich English improves performance in other languages.
arXiv Detail & Related papers (2024-09-03T03:42:56Z)
ViLCo-Bench: VIdeo Language COntinual learning Benchmark [8.660555226687098]
We present ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks. The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets. We introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects.
arXiv Detail & Related papers (2024-06-19T00:38:19Z)
Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet. On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z)
TVPR: Text-to-Video Person Retrieval and a New Benchmark [10.960048626531993]
We propose a novel Text-to-Video Person Retrieval (TVPR) task. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset. We introduce a Multielement Feature Guided Fragments Learning (MFGF) strategy, which leverages the cross-modal text-video representations to provide strong text-visual and text-motion matching information.
arXiv Detail & Related papers (2023-07-14T06:34:00Z)
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z)
A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z)
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval [39.41224716332499]
We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages.
arXiv Detail & Related papers (2022-10-07T15:30:24Z)
MuMUR : Multilingual Multimodal Universal Retrieval [19.242056928318913]
We propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs. We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space.
arXiv Detail & Related papers (2022-08-24T13:55:15Z)
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia. Most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z)
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training. VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
Improving Indonesian Text Classification Using Multilingual Language Model [0.0]
This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification models. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance.
arXiv Detail & Related papers (2020-09-12T03:16:25Z)
IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding [41.691861010118394]
We introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B.
arXiv Detail & Related papers (2020-09-11T12:21:41Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.