GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient
Partially Relevant Video Retrieval
- URL: http://arxiv.org/abs/2310.05195v2
- Date: Wed, 3 Jan 2024 07:40:15 GMT
- Title: GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient
Partially Relevant Video Retrieval
- Authors: Yuting Wang, Jinpeng Wang, Bin Chen, Ziyun Zeng, Shu-Tao Xia
- Abstract summary: Given a text query, partially relevant video retrieval (PRVR) seeks to find videos containing pertinent moments in a database.
This paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly.
Experiments on three large-scale video datasets demonstrate the superiority and efficiency of GMMFormer.
- Score: 59.47258928867802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given a text query, partially relevant video retrieval (PRVR) seeks to find
untrimmed videos containing pertinent moments in a database. For PRVR, clip
modeling is essential to capture the partial relationship between texts and
videos. Current PRVR methods adopt scanning-based clip construction to achieve
explicit clip modeling, which is information-redundant and requires a large
storage overhead. To solve the efficiency problem of PRVR methods, this paper
proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models
clip representations implicitly. During frame interactions, we incorporate
Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames
instead of the whole video. Then generated representations will contain
multi-scale clip information, achieving implicit clip modeling. In addition,
PRVR methods ignore semantic differences between text queries relevant to the
same video, leading to a sparse embedding space. We propose a query diverse
loss to distinguish these text queries, making the embedding space more
intensive and contain more semantic information. Extensive experiments on three
large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA)
demonstrate the superiority and efficiency of GMMFormer. Code is available at
\url{https://github.com/huangmozhi9527/GMMFormer}.
Related papers
- VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding [44.382937324454254]
Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding.
We propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus.
With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG.
arXiv Detail & Related papers (2024-10-11T07:42:36Z) - VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding [15.959757105308238]
Video LMMs rely on either image or video encoders to process visual inputs, each of which has its own limitations.
We introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling)
Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering.
arXiv Detail & Related papers (2024-06-13T17:59:59Z) - GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval [60.70901959953688]
We present GMMFormer v2, an uncertainty-aware framework for PRVR.
For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module.
We propose a novel optimal matching loss for fine-grained text-clip alignment.
arXiv Detail & Related papers (2024-05-22T16:55:31Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - Partially Relevant Video Retrieval [39.747235541498135]
We propose a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR)
PRVR aims to retrieve partially relevant videos from a large collection of untrimmed videos.
We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames.
arXiv Detail & Related papers (2022-08-26T09:07:16Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - Smoothed Gaussian Mixture Models for Video Classification and
Recommendation [10.119117405418868]
We propose a new cluster-and-aggregate method which we call smoothed Gaussian mixture model (SGMM)
We show, through extensive experiments on the YouTube-8M classification task, that SGMM/DSGMM is consistently better than VLAD/NetVLAD by a small but statistically significant margin.
arXiv Detail & Related papers (2020-12-17T06:52:41Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.