Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding
- URL: http://arxiv.org/abs/2204.01450v1
- Date: Mon, 4 Apr 2022 13:07:05 GMT
- Title: Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding
- Authors: Ziyue Wu, Junyu Gao, Shucheng Huang, Changsheng Xu
- Abstract summary: Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
- Score: 78.71529237748018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding temporal video segments described in natural language queries
effectively and efficiently is a crucial capability needed in
vision-and-language fields. In this paper, we deal with the fast video temporal
grounding (FVTG) task, aiming at localizing the target segment with high speed
and favorable accuracy. Most existing approaches adopt elaborately designed
cross-modal interaction modules to improve the grounding performance, which
suffer from the test-time bottleneck. Although several common space-based
methods enjoy the high-speed merit during inference, they can hardly capture
the comprehensive and explicit relations between visual and textual modalities.
In this paper, to tackle the dilemma of speed-accuracy tradeoff, we propose a
commonsense-aware cross-modal alignment (CCA) framework, which incorporates
commonsense-guided visual and text representations into a complementary common
space for fast video temporal grounding. Specifically, the commonsense concepts
are explored and exploited by extracting the structural semantic information
from a language corpus. Then, a commonsense-aware interaction module is
designed to obtain bridged visual and text features by utilizing the learned
commonsense concepts. Finally, to maintain the original semantic information of
textual queries, a cross-modal complementary common space is optimized to
obtain matching scores for performing FVTG. Extensive results on two
challenging benchmarks show that our CCA method performs favorably against
state-of-the-arts while running at high speed. Our code is available at
https://github.com/ZiyueWu59/CCA.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Video sentence grounding with temporally global textual knowledge [8.470363694067386]
Temporal sentence grounding involves the retrieval of a video moment with a natural language query.
We propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features.
arXiv Detail & Related papers (2024-04-21T10:41:04Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.