Rethinking Video Salient Object Ranking
- URL: http://arxiv.org/abs/2203.17257v1
- Date: Thu, 31 Mar 2022 17:55:54 GMT
- Title: Rethinking Video Salient Object Ranking
- Authors: Jiaying Lin and Huankang Guan and Rynson W.H. Lau
- Abstract summary: Salient Object Ranking (SOR) involves ranking the degree of saliency of multiple salient objects in an input image.
Most recently, a method is proposed for ranking salient objects in an input video based on a predicted fixation map.
We propose an end-to-end method for video salient object ranking (VSOR), with two novel modules.
- Score: 39.091162729266294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Salient Object Ranking (SOR) involves ranking the degree of saliency of
multiple salient objects in an input image. Most recently, a method is proposed
for ranking salient objects in an input video based on a predicted fixation
map. It relies solely on the density of the fixations within the salient
objects to infer their saliency ranks, which is incompatible with human
perception of saliency ranking. In this work, we propose to explicitly learn
the spatial and temporal relations between different salient objects to produce
the saliency ranks. To this end, we propose an end-to-end method for video
salient object ranking (VSOR), with two novel modules: an intra-frame adaptive
relation (IAR) module to learn the spatial relation among the salient objects
in the same frame locally and globally, and an inter-frame dynamic relation
(IDR) module to model the temporal relation of saliency across different
frames. In addition, to address the limited video types (just sports and
movies) and scene diversity in the existing VSOR dataset, we propose a new
dataset that covers different video types and diverse scenes on a large scale.
Experimental results demonstrate that our method outperforms state-of-the-art
methods in relevant fields. We will make the source code and our proposed
dataset available.
Related papers
- ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Matching Anything by Segmenting Anything [109.2507425045143]
We propose MASA, a novel method for robust instance association learning.
MASA learns instance-level correspondence through exhaustive data transformations.
We show that MASA achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences.
arXiv Detail & Related papers (2024-06-06T16:20:07Z) - Video Referring Expression Comprehension via Transformer with
Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language.
Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z) - Simplifying Open-Set Video Domain Adaptation with Contrastive Learning [16.72734794723157]
unsupervised video domain adaptation methods have been proposed to adapt a predictive model from a labelled dataset to an unlabelled dataset.
We address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source.
We propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data.
arXiv Detail & Related papers (2023-01-09T13:16:50Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Temporal Saliency Query Network for Efficient Video Recognition [82.52760040577864]
Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices.
Most existing methods select the salient frames without awareness of the class-specific saliency scores.
We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
arXiv Detail & Related papers (2022-07-21T09:23:34Z) - What and When to Look?: Temporal Span Proposal Network for Video Visual
Relation Detection [4.726777092009554]
Video Visual Relation Detection (VidD): segment-based and window-based.
We first point out the limitations these two methods have and propose Temporal Span Proposal Network (TSPN), a novel method with two advantages in terms of efficiency and effectiveness.
arXiv Detail & Related papers (2021-07-15T07:01:26Z) - Salient Object Ranking with Position-Preserved Attention [44.94722064885407]
We study the Salient Object Ranking (SOR) task, which manages to assign a ranking order of each detected object according to its visual saliency.
We propose the first end-to-end framework of the SOR task and solve it in a multi-task learning fashion.
We also introduce a Position-Preserved Attention (PPA) module tailored for the SOR branch.
arXiv Detail & Related papers (2021-06-09T13:00:05Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.