Related papers: Semantic Role Aware Correlation Transformer for Text to Video Retrieval

Semantic Role Aware Correlation Transformer for Text to Video Retrieval

URL: http://arxiv.org/abs/2206.12849v1
Date: Sun, 26 Jun 2022 11:28:03 GMT
Title: Semantic Role Aware Correlation Transformer for Text to Video Retrieval
Authors: Burak Satar, Hongyuan Zhu, Xavier Bresson, Joo Hwee Lim
Abstract summary: This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts. Preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics.
Score: 23.183653281610866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at different levels. The preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics. It also overpasses two SOTA methods in terms of two metrics.

Related papers

Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks [24.19752468668527]
Two Interactive Streams (TwInS) is a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks.<n>To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is equipped with a tailored semi-supervised training strategy.
arXiv Detail & Related papers (2026-02-14T04:11:19Z)
Two Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change [3.563409707133756]
We propose a multimodal stance detection framework that integrates textual and visual information through a hierarchical fusion approach.<n>Our method first employs a Large Language Model to retrieve stance-relevant summaries from source text, while a domain-aware image caption generator interprets visual content in the context of the target topic.<n>We evaluate our approach on the MultiClimate dataset, a benchmark for climate change-related stance detection containing aligned video frames and transcripts.
arXiv Detail & Related papers (2025-09-09T10:22:10Z)
VidText: Towards Comprehensive Evaluation for Video Text Understanding [54.15328647518558]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z)
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain. We introduce a new approach that models video-text as game players using multivariate cooperative game theory. We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
Multi-Granularity Cross-Modality Representation Learning for Named Entity Recognition on Social Media [11.235498285650142]
Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content. This work introduces the multi-granularity cross-modality representation learning. Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets.
arXiv Detail & Related papers (2022-10-19T15:14:55Z)
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval [8.855547063009828]
We propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for image-sentence retrieval. We first design the intra- and inter-modal spatial and semantic graphs based reasoning to enhance the semantic representations of objects. To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention.
arXiv Detail & Related papers (2022-10-17T10:01:16Z)
Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding. It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z)
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels. Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z)
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos. It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z)
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.