GPTSee: Enhancing Moment Retrieval and Highlight Detection via
Description-Based Similarity Features
- URL: http://arxiv.org/abs/2403.01437v2
- Date: Sun, 10 Mar 2024 09:56:22 GMT
- Title: GPTSee: Enhancing Moment Retrieval and Highlight Detection via
Description-Based Similarity Features
- Authors: Yunzhuo Sun, Yifang Xu, Zien Xie, Yukun Shu, and Sidan Du
- Abstract summary: Moment retrieval (MR) and highlight detection (HD) aim to identify relevant moments and highlights in video from natural language query.
Existing methods for MR&HD have not yet been integrated with large language models.
We propose a novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder.
- Score: 1.614471032380076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Moment retrieval (MR) and highlight detection (HD) aim to identify relevant
moments and highlights in video from corresponding natural language query.
Large language models (LLMs) have demonstrated proficiency in various computer
vision tasks. However, existing methods for MR\&HD have not yet been integrated
with LLMs. In this letter, we propose a novel two-stage model that takes the
output of LLMs as the input to the second-stage transformer encoder-decoder.
First, MiniGPT-4 is employed to generate the detailed description of the video
frame and rewrite the query statement, fed into the encoder as new features.
Then, semantic similarity is computed between the generated description and the
rewritten queries. Finally, continuous high-similarity video frames are
converted into span anchors, serving as prior position information for the
decoder. Experiments demonstrate that our approach achieves a state-of-the-art
result, and by using only span anchors and similarity scores as outputs,
positioning accuracy outperforms traditional methods, like Moment-DETR.
Related papers
- LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection [8.24662649122549]
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query.
Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information.
We propose the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks.
arXiv Detail & Related papers (2025-01-18T14:54:56Z) - VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.
Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.
Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.
During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.
Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
Grounding [27.568879624013576]
Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding.
Existing encoder-only grounding framework suffers from heavy computation due to the self-attention operation with quadratic time complexity.
We present Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases.
arXiv Detail & Related papers (2022-09-28T09:43:02Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation [90.74732705236336]
Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames.
We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
arXiv Detail & Related papers (2021-05-14T13:27:53Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.