LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
- URL: http://arxiv.org/abs/2207.04858v1
- Date: Mon, 11 Jul 2022 13:37:32 GMT
- Title: LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
- Authors: Jinbin Bai, Chunhui Liu, Feiyue Ni, Haofan Wang, Mengying Hu, Xiaofeng
Guo, Lele Cheng
- Abstract summary: Video-text retrieval is a class of cross-modal representation learning problems.
We present a novel mechanism for learning the translation relationship from a source modality space $mathcalS$ to a target modality space $mathcalT$ without the need for a joint latent space.
- Score: 3.6570455823407957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-text retrieval is a class of cross-modal representation learning
problems, where the goal is to select the video which corresponds to the text
query between a given text query and a pool of candidate videos. The
contrastive paradigm of vision-language pretraining has shown promising success
with large-scale datasets and unified transformer architecture, and
demonstrated the power of a joint latent space. Despite this, the intrinsic
divergence between the visual domain and textual domain is still far from being
eliminated, and projecting different modalities into a joint latent space might
result in the distorting of the information inside the single modality. To
overcome the above issue, we present a novel mechanism for learning the
translation relationship from a source modality space $\mathcal{S}$ to a target
modality space $\mathcal{T}$ without the need for a joint latent space, which
bridges the gap between visual and textual domains. Furthermore, to keep cycle
consistency between translations, we adopt a cycle loss involving both forward
translations from $\mathcal{S}$ to the predicted target space $\mathcal{T'}$,
and backward translations from $\mathcal{T'}$ back to $\mathcal{S}$. Extensive
experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the
superiority and effectiveness of our LaT approach compared with vanilla
state-of-the-art methods.
Related papers
- VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z) - Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval [40.83470534691711]
Cross-lingual cross-modal retrieval ( CCR) aims to retrieve visually relevant content based on non-English queries.
One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs.
We propose LE CCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations.
arXiv Detail & Related papers (2024-09-30T05:25:51Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets [3.54128607634285]
We study the impact of the visual modality on translation efficacy by leveraging real-world translation datasets.
We find that the visual modality proves advantageous for the majority of authentic translation datasets.
Our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.
arXiv Detail & Related papers (2024-04-09T08:19:10Z) - Conditional Variational Autoencoder for Sign Language Translation with
Cross-Modal Alignment [33.96363443363547]
Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences.
We propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT)
CV-SLT consists of two paths with two Kullback-Leibler divergences to regularize the outputs of the encoder and decoder.
arXiv Detail & Related papers (2023-12-25T08:20:40Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Neural Machine Translation with Contrastive Translation Memories [71.86990102704311]
Retrieval-augmented Neural Machine Translation models have been successful in many translation scenarios.
We propose a new retrieval-augmented NMT to model contrastively retrieved translation memories that are holistically similar to the source sentence.
In training phase, a Multi-TM contrastive learning objective is introduced to learn salient feature of each TM with respect to target sentence.
arXiv Detail & Related papers (2022-12-06T17:10:17Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.