SCANet: Scene Complexity Aware Network for Weakly-Supervised Video
Moment Retrieval
- URL: http://arxiv.org/abs/2310.05241v1
- Date: Sun, 8 Oct 2023 17:19:58 GMT
- Title: SCANet: Scene Complexity Aware Network for Weakly-Supervised Video
Moment Retrieval
- Authors: Sunjae Yoon, Gwanhyeong Koo, Dahyun Kim, Chang D. Yoo
- Abstract summary: Video moment retrieval aims to localize moments in video corresponding to a given language query.
We present a novel concept of a retrieval system referred to as Scene Aware Network (SCANet)
SCANet measures the scene complexity' of multiple scenes in each video and generates adaptive proposals responding to variable complexities of scenes in each video.
- Score: 27.68871220534595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment retrieval aims to localize moments in video corresponding to a
given language query. To avoid the expensive cost of annotating the temporal
moments, weakly-supervised VMR (wsVMR) systems have been studied. For such
systems, generating a number of proposals as moment candidates and then
selecting the most appropriate proposal has been a popular approach. These
proposals are assumed to contain many distinguishable scenes in a video as
candidates. However, existing proposals of wsVMR systems do not respect the
varying numbers of scenes in each video, where the proposals are heuristically
determined irrespective of the video. We argue that the retrieval system should
be able to counter the complexities caused by varying numbers of scenes in each
video. To this end, we present a novel concept of a retrieval system referred
to as Scene Complexity Aware Network (SCANet), which measures the `scene
complexity' of multiple scenes in each video and generates adaptive proposals
responding to variable complexities of scenes in each video. Experimental
results on three retrieval benchmarks (i.e., Charades-STA, ActivityNet, TVR)
achieve state-of-the-art performances and demonstrate the effectiveness of
incorporating the scene complexity.
Related papers
- A Flexible and Scalable Framework for Video Moment Search [51.47907684209207]
This paper introduces a flexible framework for retrieving a ranked list of moments from collection of videos in any length to match a text query.
Our framework, called Segment-Proposal-Ranking (SPR), simplifies the search process into three independent stages: segment retrieval, proposal generation, and moment refinement with re-ranking.
Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves state-of-the-art performance with significant reductions in computational cost and processing time.
arXiv Detail & Related papers (2025-01-09T08:54:19Z) - Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.
We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.
This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval [24.649068267308913]
Video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
We propose a novel model for effective moment localization and ranking.
We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
arXiv Detail & Related papers (2021-09-21T08:07:27Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - VLANet: Video-Language Alignment Network for Weakly-Supervised Video
Moment Retrieval [21.189093631175425]
Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query.
This paper explores methods for performing VMR in a weakly-supervised manner (wVMR)
The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.
arXiv Detail & Related papers (2020-08-24T07:54:59Z) - A Local-to-Global Approach to Multi-modal Movie Scene Segmentation [95.34033481442353]
We build a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies.
We propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods.
arXiv Detail & Related papers (2020-04-06T13:58:08Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.