Related papers: TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries

TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries

URL: http://arxiv.org/abs/2505.05714v1
Date: Fri, 09 May 2025 01:31:02 GMT
Title: TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries
Authors: Jinze Lv, Jian Chen, Zi Long, Xianghua Fu, Yin Chen,
Abstract summary: We developed TopicVD, a topic-based dataset for multimodal machine translation of documentaries.<n>To better capture the shared semantics between text and video, we propose an MMT model based on a cross-modal bidirectional attention module.
Score: 3.4883174582955983
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Most existing multimodal machine translation (MMT) datasets are predominantly composed of static images or short video clips, lacking extensive video data across diverse domains and topics. As a result, they fail to meet the demands of real-world MMT tasks, such as documentary translation. In this study, we developed TopicVD, a topic-based dataset for video-supported multimodal machine translation of documentaries, aiming to advance research in this field. We collected video-subtitle pairs from documentaries and categorized them into eight topics, such as economy and nature, to facilitate research on domain adaptation in video-guided MMT. Additionally, we preserved their contextual information to support research on leveraging the global context of documentaries in video-guided MMT. To better capture the shared semantics between text and video, we propose an MMT model based on a cross-modal bidirectional attention module. Extensive experiments on the TopicVD dataset demonstrate that visual information consistently improves the performance of the NMT model in documentary translation. However, the MMT model's performance significantly declines in out-of-domain scenarios, highlighting the need for effective domain adaptation methods. Additionally, experiments demonstrate that global context can effectively improve translation performance. % Dataset and our implementations are available at https://github.com/JinzeLv/TopicVD

Related papers

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure [5.332290080594085]
Vision-Language Models (VLMs) can process visual and textual information in multiple formats.<n>We suggest cost-effective strategies for generating summaries from text-heavy multimodal documents.
arXiv Detail & Related papers (2025-04-14T09:55:01Z)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.<n>We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.<n>Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z)
Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.<n>Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.<n>To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z)
3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z)
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF) In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z)
Video-Helpful Multimodal Machine Translation [36.9686296461948]
multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles. We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En) parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs. We propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation.
arXiv Detail & Related papers (2023-10-31T05:51:56Z)
Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models. We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks. OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z)
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z)
VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation [24.99480715551902]
multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity. We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips.
arXiv Detail & Related papers (2022-01-20T08:38:31Z)
Video Captioning with Guidance of Multimodal Latent Topics [123.5255241103578]
We propose an unified caption framework, M&M TGM, which mines multimodal topics in unsupervised fashion from data. Compared to pre-defined topics, the mined multimodal topics are more semantically and visually coherent. The results from extensive experiments conducted on the MSR-VTT and Youtube2Text datasets demonstrate the effectiveness of our proposed model.
arXiv Detail & Related papers (2017-08-31T11:18:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.