Related papers: Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network

Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network

URL: http://arxiv.org/abs/2412.15678v1
Date: Fri, 20 Dec 2024 08:50:11 GMT
Title: Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network
Authors: Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, Beibei Li,
Abstract summary: temporal sentence grounding (TSG) aims to locate query-relevant segments in videos.<n>Previous methods follow a single-thread framework that cannot co-train different pairs.<n>We propose Multi-Pair TSG, which aims to co-train these pairs.
Score: 57.72095897427665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given some video-query pairs with untrimmed videos and sentence queries, temporal sentence grounding (TSG) aims to locate query-relevant segments in these videos. Although previous respectable TSG methods have achieved remarkable success, they train each video-query pair separately and ignore the relationship between different pairs. We observe that the similar video/query content not only helps the TSG model better understand and generalize the cross-modal representation but also assists the model in locating some complex video-query pairs. Previous methods follow a single-thread framework that cannot co-train different pairs and usually spends much time re-obtaining redundant knowledge, limiting their real-world applications. To this end, in this paper, we pose a brand-new setting: Multi-Pair TSG, which aims to co-train these pairs. In particular, we propose a novel video-query co-training approach, Multi-Thread Knowledge Transfer Network, to locate a variety of video-query pairs effectively and efficiently. Firstly, we mine the spatial and temporal semantics across different queries to cooperate with each other. To learn intra- and inter-modal representations simultaneously, we design a cross-modal contrast module to explore the semantic consistency by a self-supervised strategy. To fully align visual and textual representations between different pairs, we design a prototype alignment strategy to 1) match object prototypes and phrase prototypes for spatial alignment, and 2) align activity prototypes and sentence prototypes for temporal alignment. Finally, we develop an adaptive negative selection module to adaptively generate a threshold for cross-modal matching. Extensive experiments show the effectiveness and efficiency of our proposed method.

Related papers

Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
V$^2$Dial: Unification of Video and Visual Dialog via Multimodal Experts [44.33388344586592]
V$2$Dial is a novel expert-based model geared towards simultaneously handling image and video input data for multimodal conversational tasks. We propose to unify both tasks using a single model that for the first time jointly learns the spatial and temporal features of images and videos.
arXiv Detail & Related papers (2025-03-03T21:27:38Z)
VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries. We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition [51.2715005161475]
We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition. The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations. We show that our method achieves state-of-the-art performance under various few-shot settings.
arXiv Detail & Related papers (2023-01-09T13:32:50Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding. It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z)
Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together. We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z)
Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query. The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents. We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.