Related papers: Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning

URL: http://arxiv.org/abs/2412.13543v1
Date: Wed, 18 Dec 2024 06:43:06 GMT
Title: Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
Authors: Yunbin Tu, Liang Li, Li Su, Qingming Huang,
Abstract summary: A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.<n>We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.<n>This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
Score: 56.873534081386
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain rich audio-visual content, by modeling global contrastive alignment and local fine-grained interaction between visual and audio modalities. Then, we devise the query-centric cognition that uses the deep-level query to perform the temporal-channel filtration on the shallow-level audio-visual representation. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks. Extensive experiments show QUAG achieves the SOTA results on HIREST. Further, we test QUAG on the query-based video summarization task and verify its good generalization.

Related papers

Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search [61.88597038104749]
We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning.<n>We preserve semantic consistency by integrating entity-level representations across visual and auditory streams.<n>We employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers.
arXiv Detail & Related papers (2026-01-20T08:23:29Z)
Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric [1.9774761182870912]
We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. We conduct experiments on the YouCook2 benchmark, showing promising retrieval performance.
arXiv Detail & Related papers (2025-04-06T18:18:09Z)
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
Zero-shot Audio Topic Reranking using Large Language Models [42.774019015099704]
Multimodal Video Search by Examples (MVSE) investigates using video clips as the query term for information retrieval. This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus.
arXiv Detail & Related papers (2023-09-14T11:13:36Z)
You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos [19.711703590063976]
We propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level. Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework.
arXiv Detail & Related papers (2022-05-25T16:15:46Z)
DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS. DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence. We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z)
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z)
Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
Convolutional Hierarchical Attention Network for Query-Focused Video Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs. We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module. In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.