Aligning Source Visual and Target Language Domains for Unpaired Video
Captioning
- URL: http://arxiv.org/abs/2211.12148v1
- Date: Tue, 22 Nov 2022 10:26:26 GMT
- Title: Aligning Source Visual and Target Language Domains for Unpaired Video
Captioning
- Authors: Fenglin Liu, Xian Wu, Chenyu You, Shen Ge, Yuexian Zou, Xu Sun
- Abstract summary: Training supervised video captioning model requires coupled video-caption pairs.
We introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language.
- Score: 97.58101383280345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training supervised video captioning model requires coupled video-caption
pairs. However, for many targeted languages, sufficient paired data are not
available. To this end, we introduce the unpaired video captioning task aiming
to train models without coupled video-caption pairs in target language. To
solve the task, a natural choice is to employ a two-step pipeline system: first
utilizing video-to-pivot captioning model to generate captions in pivot
language and then utilizing pivot-to-target translation model to translate the
pivot captions to the target language. However, in such a pipeline system, 1)
visual information cannot reach the translation model, generating visual
irrelevant target captions; 2) the errors in the generated pivot captions will
be propagated to the translation model, resulting in disfluent target captions.
To address these problems, we propose the Unpaired Video Captioning with Visual
Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module
(VIM), which aligns source visual and target language domains to inject the
source visual information into the target language domain. Meanwhile, VIM
directly connects the encoder of the video-to-pivot model and the decoder of
the pivot-to-target model, allowing end-to-end inference by completely skipping
the generation of pivot captions. To enhance the cross-modality injection of
the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal
Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms
pipeline systems and exceeds several supervised systems. Furthermore, equipping
existing supervised systems with our MCE can achieve 4% and 7% relative margins
on the CIDEr scores to current state-of-the-art models on the benchmark MSVD
and MSR-VTT datasets, respectively.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - 3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation [13.622700558266658]
We propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction.
Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap.
Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information.
arXiv Detail & Related papers (2024-06-07T11:15:03Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action
Recognition with Language Knowledge [35.45809761628721]
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities.
We propose an unsupervised approach to tuning video data for best zero-shot action recognition performance.
Our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks.
arXiv Detail & Related papers (2023-03-15T20:17:41Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - Make It Move: Controllable Image-to-Video Generation with Text
Descriptions [69.52360725356601]
TI2V task aims at generating videos from a static image and a text description.
To address these challenges, we propose a Motion Anchor-based video GEnerator (MAGE) with an innovative motion anchor structure.
Experiments conducted on datasets verify the effectiveness of MAGE and show appealing potentials of TI2V task.
arXiv Detail & Related papers (2021-12-06T07:00:36Z) - Visual-aware Attention Dual-stream Decoder for Video Captioning [12.139806877591212]
The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically.
This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.
We propose a new Visual-aware Attention (VA) model, which unifies changes of temporal sequence frames with the words at the previous moment.
The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated.
arXiv Detail & Related papers (2021-10-16T14:08:20Z) - End-to-End Dense Video Captioning with Parallel Decoding [53.34238344647624]
We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
arXiv Detail & Related papers (2021-08-17T17:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.