Expertized Caption Auto-Enhancement for Video-Text Retrieval
- URL: http://arxiv.org/abs/2502.02885v3
- Date: Tue, 08 Apr 2025 15:45:28 GMT
- Title: Expertized Caption Auto-Enhancement for Video-Text Retrieval
- Authors: Baoyao Yang, Junxiang Chen, Wanyun Li, Wenbin Yao, Yang Zhou,
- Abstract summary: This paper proposes an automatic caption enhancement method that improves expression quality and mitigates empiricism in augmented captions through self-learning.<n>Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability.<n>Our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo.
- Score: 10.250004732070494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-text retrieval has been stuck in the information mismatch caused by personalized and inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders an effective cross-modal representation alignment, resulting in ambiguous retrieval results. Although text rewriting methods have been proposed to broaden text expressions, the modality gap remains significant, as the text representation space is hardly expanded with insufficient semantic enrichment.Instead, this paper turns to enhancing visual presentation, bridging video expression closer to textual representation via caption generation and thereby facilitating video-text matching.While multimodal large language models (mLLM) have shown a powerful capability to convert video content into text, carefully crafted prompts are essential to ensure the reasonableness and completeness of the generated captions. Therefore, this paper proposes an automatic caption enhancement method that improves expression quality and mitigates empiricism in augmented captions through self-learning.Additionally, an expertized caption selection mechanism is designed and introduced to customize augmented captions for each video, further exploring the utilization potential of caption augmentation.Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability by circumventing lexicon dependence and introducing personalized matching. The superiority of our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo. Our code is publicly available at https://github.com/CaryXiang/ECA4VTR.
Related papers
- The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning.
Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z) - Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text.
This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content.
We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval [87.69394953339238]
We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval.
We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
arXiv Detail & Related papers (2024-02-26T17:36:50Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Does Video Summarization Require Videos? Quantifying the Effectiveness
of Language in Video Summarization [37.09662541127891]
Video summarization remains a huge challenge in computer vision due to the size of the input videos to be summarized.
We propose an efficient, language-only video summarizer that achieves competitive accuracy with high data efficiency.
arXiv Detail & Related papers (2023-09-18T00:08:49Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - CelebV-Text: A Large-Scale Facial Text-Video Dataset [91.22496444328151]
CelebV-Text is a large-scale, diverse, and high-quality dataset of facial text-video pairs.
CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy.
The superiority of CelebV-Text over other datasets is demonstrated via comprehensive statistical analysis of the videos, texts, and text-video relevance.
arXiv Detail & Related papers (2023-03-26T13:06:35Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Enriching Video Captions With Contextual Text [9.994985014558383]
We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input.
We do not preprocess the text further, and let the model directly learn to attend over it.
arXiv Detail & Related papers (2020-07-29T08:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.