Cross-modal Contrastive Learning with Asymmetric Co-attention Network
for Video Moment Retrieval
- URL: http://arxiv.org/abs/2312.07435v1
- Date: Tue, 12 Dec 2023 17:00:46 GMT
- Title: Cross-modal Contrastive Learning with Asymmetric Co-attention Network
for Video Moment Retrieval
- Authors: Love Panta, Prashant Shrestha, Brabeem Sapkota, Amrita Bhattarai,
Suresh Manandhar, Anand Kumar Sah
- Abstract summary: Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities.
Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences.
We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information.
- Score: 0.17590081165362778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video moment retrieval is a challenging task requiring fine-grained
interactions between video and text modalities. Recent work in image-text
pretraining has demonstrated that most existing pretrained models suffer from
information asymmetry due to the difference in length between visual and
textual sequences. We question whether the same problem also exists in the
video-text domain with an auxiliary need to preserve both spatial and temporal
information. Thus, we evaluate a recently proposed solution involving the
addition of an asymmetric co-attention network for video grounding tasks.
Additionally, we incorporate momentum contrastive loss for robust,
discriminative representation learning in both modalities. We note that the
integration of these supplementary modules yields better performance compared
to state-of-the-art models on the TACoS dataset and comparable results on
ActivityNet Captions, all while utilizing significantly fewer parameters with
respect to baseline.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Surgical Skill Assessment via Video Semantic Aggregation [20.396898001950156]
We propose a skill assessment framework, Video Semantic Aggregation (ViSA), which discovers different semantic parts and aggregates them acrosstemporal dimensions.
The explicit discovery of semantic parts provides an explanatory visualization that helps understand the neural network's decisions.
The experiments on two datasets show the competitiveness of ViSA compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-08-04T12:24:01Z) - You Need to Read Again: Multi-granularity Perception Network for Moment
Retrieval in Videos [19.711703590063976]
We propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level.
Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework.
arXiv Detail & Related papers (2022-05-25T16:15:46Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - SSAN: Separable Self-Attention Network for Video Representation Learning [11.542048296046524]
We propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially.
By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning.
Our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets.
arXiv Detail & Related papers (2021-05-27T10:02:04Z) - Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.