Related papers: Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

URL: http://arxiv.org/abs/2503.05186v4
Date: Wed, 26 Mar 2025 02:09:24 GMT
Title: Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
Authors: Chan Hur, Jeong-hun Hong, Dong-hun Lee, Dabin Kang, Semin Myeong, Sang-hyo Park, Hyeyoung Park,
Abstract summary: We propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration.<n>The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity.
Score: 3.9633773442108873
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets.

Related papers

Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss. Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z)
Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative [19.79736018383692]
Existing video captioning benchmarks and models lack causal-temporal narrative.<n>This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content.<n>We propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting; and (2) a Cause-Effect Network (CEN) with separate encoders for capturing cause and effect dynamics.
arXiv Detail & Related papers (2024-06-10T17:34:24Z)
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework. Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z)
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. Recent contrastive learning methods have shown promising results for text-video retrieval. This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z)
Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion [6.096411752534632]
The application of video captioning models aims at translating content of videos by using accurate natural language. Existing methods often fail in generating sufficient feature representations of video content. We propose a video captioning model based on dual graphs and gated fusion.
arXiv Detail & Related papers (2023-08-13T05:18:08Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound. Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z)
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z)
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space. We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z)
Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning. We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.